9.1.5. Storage¶

Storages for storing extracted features.

pydantic model junifer.storage.BaseFeatureStorage¶

Abstract base class for feature storage.

For every storage, one needs to provide a concrete implementation of this abstract class.

Parameters:

uripathlib.Path: The path to the storage.
single_outputbool, optional: Whether to have single output (default True).

Raises:

AttributeError: If the storage does not have _STORAGE_TYPES attribute.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "BaseFeatureStorage",
   "description": "Abstract base class for feature storage.\n\nFor every storage, one needs to provide a concrete\nimplementation of this abstract class.\n\nParameters\n----------\nuri : pathlib.Path\n    The path to the storage.\nsingle_output : bool, optional\n    Whether to have single output (default True).\n\nRaises\n------\nAttributeError\n    If the storage does not have ``_STORAGE_TYPES`` attribute.",
   "type": "object",
   "properties": {
      "uri": {
         "format": "path",
         "title": "Uri",
         "type": "string"
      },
      "single_output": {
         "default": true,
         "title": "Single Output",
         "type": "boolean"
      }
   },
   "required": [
      "uri"
   ]
}

Config:

frozen: bool = True
use_enum_values: bool = True

Fields:

single_output (bool)
uri (pathlib.Path)

field single_output: bool = True¶

field uri: Path [Required]¶

abstract collect()¶

Collect data.

abstract list_features()¶

List the features in the storage.

Returns:

dict: List of features in the storage. The keys are the feature MD5 to be used in read_df() and the values are the metadata of each feature.

model_post_init(context)¶

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

abstract read(feature_name=None, feature_md5=None)¶

Read stored feature.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

dict: The stored feature as a dictionary.

abstract read_df(feature_name=None, feature_md5=None)¶

Read feature into a pandas DataFrame.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

pandas.DataFrame: The features as a dataframe.

store(kind, **kwargs)¶

Store extracted features data.

Parameters:

kindStorageType: The storage kind.
**kwargs: The keyword arguments.

Raises:

ValueError: If kind is invalid.

store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind=MatrixKind.Full, diagonal=True)¶

Store matrix.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The matrix data to store.
col_nameslist-like of str, optional: The column labels (default None).
row_nameslist-like of str, optional: The row labels (default None).
matrix_kindMatrixKind, optional: The matrix kind (default MatrixKind.Full).
diagonalbool, optional: Whether to store the diagonal. If matrix_kind=MatrixKind.Full, setting this to False will raise an error (default True).

abstract store_metadata(meta_md5, element, meta)¶

Store metadata.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
metadict: The metadata as a dictionary.

store_scalar_table(meta_md5, element, data, col_names=None, row_names=None, row_header_col_name='feature')¶

Store table with scalar values.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The timeseries data to store.
col_nameslist-like of str, optional: The column labels (default None).
row_nameslist-like of str, optional: The row labels (default None).
row_header_col_namestr, optional: The column name for the row header column (default “feature”).

store_timeseries(meta_md5, element, data, col_names=None)¶

Store timeseries.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The timeseries data to store.
col_nameslist-like of str, optional: The column labels (default None).

store_timeseries_2d(meta_md5, element, data, col_names=None, row_names=None)¶

Store 2D timeseries.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The timeseries data to store.
col_nameslist-like of str, optional: The column labels (default None).
row_nameslist-like of str, optional: The row labels (default None).

store_vector(meta_md5, element, data, col_names=None)¶

Store vector.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray or list: The vector data to store.
col_nameslist-like of str, optional: The column labels (default None).

validate_input(input_)¶

Validate the input to the pipeline step.

Parameters:

input_list of str: The input to the pipeline step.

Raises:

ValueError: If the input_ is invalid.

pydantic model junifer.storage.HDF5FeatureStorage¶

Concrete implementation for feature storage via HDF5.

Parameters:

uripathlib.Path: The path to the file to be used.
single_outputbool, optional: If False, will create one HDF5 file per element. The name of the file will be prefixed with the respective element. If True, will create only one HDF5 file as specified in the uri and store all the elements in the same file. Concurrent writes should be handled with care (default True).
overwritebool or “update”, optional: Whether to overwrite existing file. If True, will overwrite and if “update”, will update existing entry or append (default “update”).
compression{0-9}, optional: Level of gzip compression: 0 (lowest) to 9 (highest) (default 7).
force_float32bool, optional: Whether to force casting of numpy.ndarray values to float32 if float64 values are found (default True).
chunk_sizepositive int, optional: The chunk size to use when collecting data from element files in collect(). If the file count is smaller than the value, the minimum is used (default 100).

See also

SQLiteFeatureStorage: The concrete class for SQLite-based feature storage.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "HDF5FeatureStorage",
   "description": "Concrete implementation for feature storage via HDF5.\n\nParameters\n----------\nuri : pathlib.Path\n    The path to the file to be used.\nsingle_output : bool, optional\n    If False, will create one HDF5 file per element. The name\n    of the file will be prefixed with the respective element.\n    If True, will create only one HDF5 file as specified in the\n    ``uri`` and store all the elements in the same file. Concurrent\n    writes should be handled with care (default True).\noverwrite : bool or \"update\", optional\n    Whether to overwrite existing file. If True, will overwrite and\n    if \"update\", will update existing entry or append (default \"update\").\ncompression : {0-9}, optional\n    Level of gzip compression: 0 (lowest) to 9 (highest) (default 7).\nforce_float32 : bool, optional\n    Whether to force casting of numpy.ndarray values to float32 if float64\n    values are found (default True).\nchunk_size : positive int, optional\n    The chunk size to use when collecting data from element files in\n    :meth:`.collect`. If the file count is smaller than the value, the\n    minimum is used (default 100).\n\nSee Also\n--------\nSQLiteFeatureStorage : The concrete class for SQLite-based feature storage.",
   "type": "object",
   "properties": {
      "uri": {
         "format": "path",
         "title": "Uri",
         "type": "string"
      },
      "single_output": {
         "default": true,
         "title": "Single Output",
         "type": "boolean"
      },
      "overwrite": {
         "anyOf": [
            {
               "type": "boolean"
            },
            {
               "type": "string"
            }
         ],
         "default": "update",
         "title": "Overwrite"
      },
      "compression": {
         "default": 7,
         "enum": [
            0,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
         ],
         "title": "Compression",
         "type": "integer"
      },
      "force_float32": {
         "default": true,
         "title": "Force Float32",
         "type": "boolean"
      },
      "chunk_size": {
         "default": 100,
         "exclusiveMinimum": 0,
         "title": "Chunk Size",
         "type": "integer"
      }
   },
   "required": [
      "uri"
   ]
}

Config:

frozen: bool = True
use_enum_values: bool = True

Fields:

chunk_size (Annotated[int, annotated_types.Gt(gt=0)])
compression (Literal[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
force_float32 (bool)
overwrite (bool | str)

field chunk_size: Annotated[int, Gt(gt=0)] = 100¶

Constraints:

gt = 0

field compression: Literal[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] = 7¶

field force_float32: bool = True¶

field overwrite: bool | str = 'update'¶

collect()¶

Implement data collection.

This method globs the element files and runs a loop over them while reading metadata and then runs a loop over all the stored features in the metadata, storing the metadata and the feature data right after reading.

Raises:

NotImplementedError: If single_output is True.

list_features()¶

List the features in the storage.

Returns:

dict: List of features in the storage. The keys are the feature MD5 to be used in read_df() and the values are the metadata of each feature.

read(feature_name=None, feature_md5=None)¶

Read stored feature.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

dict: The stored feature as a dictionary.

Raises:

ValueError: If both feature_md5 and feature_name are provided or if none of feature_md5 or feature_name is provided.
IOError: If HDF5 file does not exist.
RuntimeError: If feature is not found or if duplicate feature is found with the same name.

read_df(feature_name=None, feature_md5=None)¶

Read feature into a pandas.DataFrame.

Either one of feature_name or feature_md5 needs to be specified.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

pandas.DataFrame: The features as a dataframe.

Raises:

IOError: If HDF5 file does not exist.

store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind=MatrixKind.Full, diagonal=True, row_header_col_name='ROI')¶

Store matrix.

This method performs parameter checks and then calls _store_data for storing the data.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as dictionary.
datanumpy.ndarray: The matrix data to store.
col_nameslist-like of str, optional: The column labels (default None).
row_nameslist-like of str, optional: The row labels (default None).
matrix_kindMatrixKind, optional: The matrix kind (default MatrixKind.Full).
diagonalbool, optional: Whether to store the diagonal. If matrix_kind=MatrixKind.Full, setting this to False will raise an error (default True).
row_header_col_namestr, optional: The column name for the row header column (default “ROI”).

store_metadata(meta_md5, element, meta)¶

Store metadata.

This method first loads existing metadata (if any) using _read_metadata and appends to it the new metadata and then saves the updated metadata using _write_processed_data. It will only store metadata if meta_md5 is not found already.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
metadict: The metadata as a dictionary.

store_scalar_table(meta_md5, element, data, col_names=None, row_names=None, row_header_col_name='feature')¶

Store table with scalar values.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The scalar table data to store.
col_nameslist-like of str, optional: The column labels (default None).
row_nameslist-like of str, optional: The row labels (default None).
row_header_col_namestr, optional: The column name for the row header column (default “feature”).

store_timeseries(meta_md5, element, data, col_names=None)¶

Store timeseries.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as dictionary.
datanumpy.ndarray: The timeseries data to store.
col_nameslist-like of str, optional: The column labels (default None).

store_timeseries_2d(meta_md5, element, data, col_names=None, row_names=None)¶

Store 2D timeseries.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The 2D timeseries data to store.
col_nameslist-like of str, optional: The column labels (default None).
row_nameslist-like of str, optional: The row labels (default None).

store_vector(meta_md5, element, data, col_names=None)¶

Store vector.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as dictionary.
datanumpy.ndarray or list: The vector data to store.
col_nameslist-like of str, optional: The column labels (default None).

enum junifer.storage.MatrixKind(value)¶

Accepted matrix kind value.

Member Type:: str

Valid values are as follows:

UpperTriangle = <MatrixKind.UpperTriangle: 'triu'>¶

LowerTriangle = <MatrixKind.LowerTriangle: 'tril'>¶

Full = <MatrixKind.Full: 'full'>¶

pydantic model junifer.storage.PandasBaseFeatureStorage¶

Abstract base class for feature storage via pandas.

For every interface that is required, one needs to provide a concrete implementation of this abstract class.

Parameters:

uristr or pathlib.Path: The path to the storage.
single_outputbool, optional: Whether to have single output (default True).

See also

PandasBaseFeatureStorage: The base class for Pandas-based feature storage.
HDF5FeatureStorage: The concrete class for HDF5-based feature storage.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Show JSON schema

{
   "title": "SQLiteFeatureStorage",
   "description": "Concrete implementation for feature storage via SQLite.\n\nParameters\n----------\nuri : str or pathlib.Path\n    The path to the file to be used.\nsingle_output : bool, optional\n    If False, will create one SQLite file per element. The name\n    of the file will be prefixed with the respective element.\n    If True, will create only one SQLite file as specified in the\n    ``uri`` and store all the elements in the same file. This behaviour\n    is only suitable for non-parallel executions. SQLite does not\n    support concurrency (default True).\nupsert : :enum:`.Upsert`, optional\n    Upsert mode. If ``Upsert.Ignore`` is used, the existing elements are\n    ignored. If ``Upsert.Update``, the existing elements are updated\n    (default ``Upsert.Update``).\n\nSee Also\n--------\nPandasBaseFeatureStorage : The base class for Pandas-based feature storage.\nHDF5FeatureStorage : The concrete class for HDF5-based feature storage.",
   "type": "object",
   "properties": {
      "uri": {
         "format": "path",
         "title": "Uri",
         "type": "string"
      },
      "single_output": {
         "default": true,
         "title": "Single Output",
         "type": "boolean"
      },
      "upsert": {
         "$ref": "#/$defs/Upsert",
         "default": "update"
      }
   },
   "$defs": {
      "Upsert": {
         "description": "Accepted upsert value.",
         "enum": [
            "update",
            "ignore"
         ],
         "title": "Upsert",
         "type": "string"
      }
   },
   "required": [
      "uri"
   ]
}

Config:

frozen: bool = True
use_enum_values: bool = True

Fields:

upsert (junifer.storage.sqlite.Upsert)

field upsert: Upsert = Upsert.Update¶

collect()¶

Implement data collection.

Raises:

NotImplementedError: If single_output is True.

get_engine(element=None)¶

Get engine.

Parameters:

elementdict, optional: The element as dictionary (default None).

Returns:

sqlalchemy.engine.Engine: The sqlalchemy engine.

Raises:

ValueError: If element=None when single_output=False.

list_features()¶

List the features in the storage.

Returns:

dict: List of features in the storage. The keys are the feature MD5 to be used in read_df() and the values are the metadata of each feature.

read(feature_name=None, feature_md5=None)¶

Read stored feature.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

dict: The stored feature as a dictionary.

Raises:

NotImplementedError

read_df(feature_name=None, feature_md5=None)¶

Implement feature reading into a pandas DataFrame.

Either one of feature_name or feature_md5 needs to be specified.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

pandas.DataFrame: The features as a dataframe.

Raises:

ValueError: If parameter values are invalid or feature is not found or multiple features are found.

store_df(meta_md5, element, df)¶

Implement pandas DataFrame storing.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
dfpandas.DataFrame or pandas.Series: The pandas DataFrame or Series to store.

Raises:

ValueError: If the dataframe index has items that are not in the index generated from the metadata.

store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind=MatrixKind.Full, diagonal=True)¶

Store matrix.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The matrix data to store.
col_nameslist-like of str, optional: The column labels (default None).
row_nameslist-like of str, optional: The row labels (optional None).
matrix_kindMatrixKind, optional: The matrix kind (default MatrixKind.Full).
diagonalbool, optional: Whether to store the diagonal. If matrix_kind=MatrixKind.Full, setting this to False will raise an error (default True).

store_metadata(meta_md5, element, meta)¶

Implement metadata storing in the storage.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
metadict: The metadata as a dictionary.

enum junifer.storage.StorageType(value)¶

Accepted storage type.

Member Type:: str

Valid values are as follows:

Vector = <StorageType.Vector: 'vector'>¶

Matrix = <StorageType.Matrix: 'matrix'>¶

Timeseries = <StorageType.Timeseries: 'timeseries'>¶

Timeseries2D = <StorageType.Timeseries2D: 'timeseries_2d'>¶

ScalarTable = <StorageType.ScalarTable: 'scalar_table'>¶

enum junifer.storage.Upsert(value)¶

Accepted upsert value.

Member Type:: str

Valid values are as follows:

Update = <Upsert.Update: 'update'>¶

Ignore = <Upsert.Ignore: 'ignore'>¶