9.1.5. Storage¶
Storages for storing extracted features.
- pydantic model junifer.storage.BaseFeatureStorage¶
Abstract base class for feature storage.
For every storage, one needs to provide a concrete implementation of this abstract class.
- Parameters:
- uri
pathlib.Path The path to the storage.
- single_outputbool, optional
Whether to have single output (default True).
- uri
- Raises:
AttributeErrorIf the storage does not have
_STORAGE_TYPESattribute.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
Show JSON schema
{ "title": "BaseFeatureStorage", "description": "Abstract base class for feature storage.\n\nFor every storage, one needs to provide a concrete\nimplementation of this abstract class.\n\nParameters\n----------\nuri : pathlib.Path\n The path to the storage.\nsingle_output : bool, optional\n Whether to have single output (default True).\n\nRaises\n------\nAttributeError\n If the storage does not have ``_STORAGE_TYPES`` attribute.", "type": "object", "properties": { "uri": { "format": "path", "title": "Uri", "type": "string" }, "single_output": { "default": true, "title": "Single Output", "type": "boolean" } }, "required": [ "uri" ] }
- Config:
frozen: bool = True
use_enum_values: bool = True
- Fields:
single_output (bool)uri (pathlib.Path)
- abstract collect()¶
Collect data.
- abstract list_features()¶
List the features in the storage.
- model_post_init(context)¶
Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.
- abstract read(feature_name=None, feature_md5=None)¶
Read stored feature.
- abstract read_df(feature_name=None, feature_md5=None)¶
Read feature into a pandas DataFrame.
- Parameters:
- Returns:
pandas.DataFrameThe features as a dataframe.
- store(kind, **kwargs)¶
Store extracted features data.
- Parameters:
- kind
StorageType The storage kind.
- **kwargs
The keyword arguments.
- kind
- Raises:
ValueErrorIf
kindis invalid.
- store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind=MatrixKind.Full, diagonal=True)¶
Store matrix.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The matrix data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- row_nameslist-like of
str, optional The row labels (default None).
- matrix_kind
MatrixKind, optional The matrix kind (default
MatrixKind.Full).- diagonalbool, optional
Whether to store the diagonal. If
matrix_kind=MatrixKind.Full, setting this to False will raise an error (default True).
- meta_md5
- abstract store_metadata(meta_md5, element, meta)¶
Store metadata.
- store_scalar_table(meta_md5, element, data, col_names=None, row_names=None, row_header_col_name='feature')¶
Store table with scalar values.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The timeseries data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- row_nameslist-like of
str, optional The row labels (default None).
- row_header_col_name
str, optional The column name for the row header column (default “feature”).
- meta_md5
- store_timeseries(meta_md5, element, data, col_names=None)¶
Store timeseries.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The timeseries data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- meta_md5
- store_timeseries_2d(meta_md5, element, data, col_names=None, row_names=None)¶
Store 2D timeseries.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The timeseries data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- row_nameslist-like of
str, optional The row labels (default None).
- meta_md5
- store_vector(meta_md5, element, data, col_names=None)¶
Store vector.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarrayorlist The vector data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- meta_md5
- validate_input(input_)¶
Validate the input to the pipeline step.
- Parameters:
- Raises:
ValueErrorIf the
input_is invalid.
- pydantic model junifer.storage.HDF5FeatureStorage¶
Concrete implementation for feature storage via HDF5.
- Parameters:
- uri
pathlib.Path The path to the file to be used.
- single_outputbool, optional
If False, will create one HDF5 file per element. The name of the file will be prefixed with the respective element. If True, will create only one HDF5 file as specified in the
uriand store all the elements in the same file. Concurrent writes should be handled with care (default True).- overwritebool or “update”, optional
Whether to overwrite existing file. If True, will overwrite and if “update”, will update existing entry or append (default “update”).
- compression{0-9}, optional
Level of gzip compression: 0 (lowest) to 9 (highest) (default 7).
- force_float32bool, optional
Whether to force casting of numpy.ndarray values to float32 if float64 values are found (default True).
- chunk_sizepositive
int, optional The chunk size to use when collecting data from element files in
collect(). If the file count is smaller than the value, the minimum is used (default 100).
- uri
See also
SQLiteFeatureStorageThe concrete class for SQLite-based feature storage.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
Show JSON schema
{ "title": "HDF5FeatureStorage", "description": "Concrete implementation for feature storage via HDF5.\n\nParameters\n----------\nuri : pathlib.Path\n The path to the file to be used.\nsingle_output : bool, optional\n If False, will create one HDF5 file per element. The name\n of the file will be prefixed with the respective element.\n If True, will create only one HDF5 file as specified in the\n ``uri`` and store all the elements in the same file. Concurrent\n writes should be handled with care (default True).\noverwrite : bool or \"update\", optional\n Whether to overwrite existing file. If True, will overwrite and\n if \"update\", will update existing entry or append (default \"update\").\ncompression : {0-9}, optional\n Level of gzip compression: 0 (lowest) to 9 (highest) (default 7).\nforce_float32 : bool, optional\n Whether to force casting of numpy.ndarray values to float32 if float64\n values are found (default True).\nchunk_size : positive int, optional\n The chunk size to use when collecting data from element files in\n :meth:`.collect`. If the file count is smaller than the value, the\n minimum is used (default 100).\n\nSee Also\n--------\nSQLiteFeatureStorage : The concrete class for SQLite-based feature storage.", "type": "object", "properties": { "uri": { "format": "path", "title": "Uri", "type": "string" }, "single_output": { "default": true, "title": "Single Output", "type": "boolean" }, "overwrite": { "anyOf": [ { "type": "boolean" }, { "type": "string" } ], "default": "update", "title": "Overwrite" }, "compression": { "default": 7, "enum": [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ], "title": "Compression", "type": "integer" }, "force_float32": { "default": true, "title": "Force Float32", "type": "boolean" }, "chunk_size": { "default": 100, "exclusiveMinimum": 0, "title": "Chunk Size", "type": "integer" } }, "required": [ "uri" ] }
- Config:
frozen: bool = True
use_enum_values: bool = True
- Fields:
chunk_size (Annotated[int, annotated_types.Gt(gt=0)])compression (Literal[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])force_float32 (bool)overwrite (bool | str)
- collect()¶
Implement data collection.
This method globs the element files and runs a loop over them while reading metadata and then runs a loop over all the stored features in the metadata, storing the metadata and the feature data right after reading.
- Raises:
NotImplementedErrorIf
single_outputis True.
- list_features()¶
List the features in the storage.
- read(feature_name=None, feature_md5=None)¶
Read stored feature.
- Parameters:
- Returns:
dictThe stored feature as a dictionary.
- Raises:
ValueErrorIf both
feature_md5andfeature_nameare provided or if none offeature_md5orfeature_nameis provided.IOErrorIf HDF5 file does not exist.
RuntimeErrorIf feature is not found or if duplicate feature is found with the same name.
- read_df(feature_name=None, feature_md5=None)¶
Read feature into a pandas.DataFrame.
Either one of
feature_nameorfeature_md5needs to be specified.- Parameters:
- Returns:
pandas.DataFrameThe features as a dataframe.
- Raises:
IOErrorIf HDF5 file does not exist.
- store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind=MatrixKind.Full, diagonal=True, row_header_col_name='ROI')¶
Store matrix.
This method performs parameter checks and then calls
_store_datafor storing the data.- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as dictionary.
- data
numpy.ndarray The matrix data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- row_nameslist-like of
str, optional The row labels (default None).
- matrix_kind
MatrixKind, optional The matrix kind (default
MatrixKind.Full).- diagonalbool, optional
Whether to store the diagonal. If
matrix_kind=MatrixKind.Full, setting this to False will raise an error (default True).- row_header_col_name
str, optional The column name for the row header column (default “ROI”).
- meta_md5
- store_metadata(meta_md5, element, meta)¶
Store metadata.
This method first loads existing metadata (if any) using
_read_metadataand appends to it the new metadata and then saves the updated metadata using_write_processed_data. It will only store metadata ifmeta_md5is not found already.
- store_scalar_table(meta_md5, element, data, col_names=None, row_names=None, row_header_col_name='feature')¶
Store table with scalar values.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The scalar table data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- row_nameslist-like of
str, optional The row labels (default None).
- row_header_col_name
str, optional The column name for the row header column (default “feature”).
- meta_md5
- store_timeseries(meta_md5, element, data, col_names=None)¶
Store timeseries.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as dictionary.
- data
numpy.ndarray The timeseries data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- meta_md5
- store_timeseries_2d(meta_md5, element, data, col_names=None, row_names=None)¶
Store 2D timeseries.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The 2D timeseries data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- row_nameslist-like of
str, optional The row labels (default None).
- meta_md5
- enum junifer.storage.MatrixKind(value)¶
Accepted matrix kind value.
- Member Type:
Valid values are as follows:
- UpperTriangle = <MatrixKind.UpperTriangle: 'triu'>¶
- LowerTriangle = <MatrixKind.LowerTriangle: 'tril'>¶
- Full = <MatrixKind.Full: 'full'>¶
- pydantic model junifer.storage.PandasBaseFeatureStorage¶
Abstract base class for feature storage via pandas.
For every interface that is required, one needs to provide a concrete implementation of this abstract class.
- Parameters:
- uri
strorpathlib.Path The path to the storage.
- single_outputbool, optional
Whether to have single output (default True).
- uri
See also
BaseFeatureStorageThe base class for feature storage.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
Show JSON schema
{ "title": "PandasBaseFeatureStorage", "description": "Abstract base class for feature storage via pandas.\n\nFor every interface that is required, one needs to provide a concrete\nimplementation of this abstract class.\n\nParameters\n----------\nuri : str or pathlib.Path\n The path to the storage.\nsingle_output : bool, optional\n Whether to have single output (default True).\n\nSee Also\n--------\nBaseFeatureStorage : The base class for feature storage.", "type": "object", "properties": { "uri": { "format": "path", "title": "Uri", "type": "string" }, "single_output": { "default": true, "title": "Single Output", "type": "boolean" } }, "required": [ "uri" ] }
- Config:
frozen: bool = True
use_enum_values: bool = True
- Fields:
- static element_to_index(element, n_rows=1, rows_col_name=None)¶
Convert the element metadata to index.
- Parameters:
- Returns:
pandas.Indexorpandas.MultiIndexThe index of the dataframe to store.
- store_df(meta_md5, element, df)¶
Implement pandas DataFrame storing.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- df
pandas.DataFrameorpandas.Series The pandas DataFrame or Series to store.
- meta_md5
- Raises:
ValueErrorIf the dataframe index has items that are not in the index generated from the metadata.
- store_timeseries(meta_md5, element, data, col_names=None)¶
Store timeseries.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The timeseries data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- meta_md5
- pydantic model junifer.storage.SQLiteFeatureStorage¶
Concrete implementation for feature storage via SQLite.
- Parameters:
- uri
strorpathlib.Path The path to the file to be used.
- single_outputbool, optional
If False, will create one SQLite file per element. The name of the file will be prefixed with the respective element. If True, will create only one SQLite file as specified in the
uriand store all the elements in the same file. This behaviour is only suitable for non-parallel executions. SQLite does not support concurrency (default True).- upsert
Upsert, optional Upsert mode. If
Upsert.Ignoreis used, the existing elements are ignored. IfUpsert.Update, the existing elements are updated (defaultUpsert.Update).
- uri
See also
PandasBaseFeatureStorageThe base class for Pandas-based feature storage.
HDF5FeatureStorageThe concrete class for HDF5-based feature storage.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
Show JSON schema
{ "title": "SQLiteFeatureStorage", "description": "Concrete implementation for feature storage via SQLite.\n\nParameters\n----------\nuri : str or pathlib.Path\n The path to the file to be used.\nsingle_output : bool, optional\n If False, will create one SQLite file per element. The name\n of the file will be prefixed with the respective element.\n If True, will create only one SQLite file as specified in the\n ``uri`` and store all the elements in the same file. This behaviour\n is only suitable for non-parallel executions. SQLite does not\n support concurrency (default True).\nupsert : :enum:`.Upsert`, optional\n Upsert mode. If ``Upsert.Ignore`` is used, the existing elements are\n ignored. If ``Upsert.Update``, the existing elements are updated\n (default ``Upsert.Update``).\n\nSee Also\n--------\nPandasBaseFeatureStorage : The base class for Pandas-based feature storage.\nHDF5FeatureStorage : The concrete class for HDF5-based feature storage.", "type": "object", "properties": { "uri": { "format": "path", "title": "Uri", "type": "string" }, "single_output": { "default": true, "title": "Single Output", "type": "boolean" }, "upsert": { "$ref": "#/$defs/Upsert", "default": "update" } }, "$defs": { "Upsert": { "description": "Accepted upsert value.", "enum": [ "update", "ignore" ], "title": "Upsert", "type": "string" } }, "required": [ "uri" ] }
- Config:
frozen: bool = True
use_enum_values: bool = True
- Fields:
upsert (junifer.storage.sqlite.Upsert)
- collect()¶
Implement data collection.
- Raises:
NotImplementedErrorIf
single_outputis True.
- get_engine(element=None)¶
Get engine.
- Parameters:
- element
dict, optional The element as dictionary (default None).
- element
- Returns:
sqlalchemy.engine.EngineThe sqlalchemy engine.
- Raises:
ValueErrorIf
element=Nonewhensingle_output=False.
- list_features()¶
List the features in the storage.
- read(feature_name=None, feature_md5=None)¶
Read stored feature.
- Parameters:
- Returns:
dictThe stored feature as a dictionary.
- Raises:
- read_df(feature_name=None, feature_md5=None)¶
Implement feature reading into a pandas DataFrame.
Either one of
feature_nameorfeature_md5needs to be specified.- Parameters:
- Returns:
pandas.DataFrameThe features as a dataframe.
- Raises:
ValueErrorIf parameter values are invalid or feature is not found or multiple features are found.
- store_df(meta_md5, element, df)¶
Implement pandas DataFrame storing.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- df
pandas.DataFrameorpandas.Series The pandas DataFrame or Series to store.
- meta_md5
- Raises:
ValueErrorIf the dataframe index has items that are not in the index generated from the metadata.
- store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind=MatrixKind.Full, diagonal=True)¶
Store matrix.
- Parameters:
- meta_md5
str The metadata MD5 hash.
- element
dict The element as a dictionary.
- data
numpy.ndarray The matrix data to store.
- col_nameslist-like of
str, optional The column labels (default None).
- row_nameslist-like of
str, optional The row labels (optional None).
- matrix_kind
MatrixKind, optional The matrix kind (default
MatrixKind.Full).- diagonalbool, optional
Whether to store the diagonal. If
matrix_kind=MatrixKind.Full, setting this to False will raise an error (default True).
- meta_md5
- enum junifer.storage.StorageType(value)¶
Accepted storage type.
- Member Type:
Valid values are as follows:
- Vector = <StorageType.Vector: 'vector'>¶
- Matrix = <StorageType.Matrix: 'matrix'>¶
- Timeseries = <StorageType.Timeseries: 'timeseries'>¶
- Timeseries2D = <StorageType.Timeseries2D: 'timeseries_2d'>¶
- ScalarTable = <StorageType.ScalarTable: 'scalar_table'>¶