9.1.5. Storage#
Provide imports for storage sub-package.
- class junifer.storage.BaseFeatureStorage(uri, storage_types, single_output=True)#
Abstract base class for feature storage.
For every interface that is required, one needs to provide a concrete implementation of this abstract class.
- Parameters:
- abstract collect()#
Collect data.
- get_valid_inputs()#
Get valid storage types for input.
- abstract list_features()#
List the features in the storage.
- abstract read_df(feature_name=None, feature_md5=None)#
Read feature into a pandas DataFrame.
- Parameters:
- Returns:
pandas.DataFrame
The features as a dataframe.
- store(kind, **kwargs)#
Store extracted features data.
- Parameters:
- kind{“matrix”, “timeseries”, “vector”}
The storage kind.
- **kwargs
The keyword arguments.
- Raises:
ValueError
If
kind
is invalid.
- store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind='full', diagonal=True)#
Store matrix.
- Parameters:
- meta_md5
str
The metadata MD5 hash.
- element
dict
The element as a dictionary.
- data
numpy.ndarray
The matrix data to store.
- col_names
list
ortuple
ofstr
, optional The column labels (default None).
- row_names
str
, optional The row labels (default None).
- matrix_kind
str
, optional The kind of matrix:
triu
: store upper triangular onlytril
: store lower triangularfull
: full matrix
(default “full”).
- diagonalbool, optional
Whether to store the diagonal. If
matrix_kind = full
, setting this to False will raise an error (default True).
- meta_md5
- abstract store_metadata(meta_md5, element, meta)#
Store metadata.
- store_timeseries(meta_md5, element, data, col_names=None)#
Store timeseries.
- store_vector(meta_md5, element, data, col_names=None)#
Store vector.
- validate(input_)#
Validate the input to the pipeline step.
- Parameters:
- Raises:
ValueError
If the
input_
is invalid.
- class junifer.storage.HDF5FeatureStorage(uri, single_output=True, overwrite='update', compression=7, force_float32=True, chunk_size=100)#
Concrete implementation for feature storage via HDF5.
- Parameters:
- uri
str
orpathlib.Path
The path to the file to be used.
- single_outputbool, optional
If False, will create one HDF5 file per element. The name of the file will be prefixed with the respective element. If True, will create only one HDF5 file as specified in the
uri
and store all the elements in the same file. Concurrent writes should be handled with care (default True).- overwritebool or “update”, optional
Whether to overwrite existing file. If True, will overwrite and if “update”, will update existing entry or append (default “update”).
- compression{0-9}, optional
Level of gzip compression: 0 (lowest) to 9 (highest) (default 7).
- force_float32bool, optional
Whether to force casting of numpy.ndarray values to float32 if float64 values are found (default True).
- chunk_size
int
, optional The chunk size to use when collecting data from element files in
collect()
. If the file count is smaller than the value, the minimum is used (default 100).
- uri
See also
SQLiteFeatureStorage
The concrete class for SQLite-based feature storage.
- collect()#
Implement data collection.
This method globs the element files and runs a loop over them while reading metadata and then runs a loop over all the stored features in the metadata, storing the metadata and the feature data right after reading.
- Raises:
NotImplementedError
If
single_output
is True.
- get_valid_inputs()#
Get valid storage types for input.
- list_features()#
List the features in the storage.
- read_df(feature_name=None, feature_md5=None)#
Read feature into a pandas.DataFrame.
Either one of
feature_name
orfeature_md5
needs to be specified.- Parameters:
- Returns:
pandas.DataFrame
The features as a dataframe.
- Raises:
IOError
If HDF5 file does not exist.
- store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind='full', diagonal=True, row_header_col_name='ROI')#
Store matrix.
This method performs parameter checks and then calls
_store_data
for storing the data.- Parameters:
- meta_md5
str
The metadata MD5 hash.
- element
dict
The element as dictionary.
- data
numpy.ndarray
The matrix data to store.
- col_names
list
ortuple
ofstr
, optional The column labels (default None).
- row_names
list
ortuple
ofstr
, optional The row labels (default None).
- matrix_kind
str
, optional The kind of matrix:
triu
: store upper triangular onlytril
: store lower triangularfull
: full matrix
(default “full”).
- diagonalbool, optional
Whether to store the diagonal. If
matrix_kind
is “full”, setting this to False will raise an error (default True).- row_header_col_name
str
, optional The column name for the row header column (default “ROI”).
- meta_md5
- Raises:
ValueError
If invalid
matrix_kind
is provided,diagonal = False
formatrix_kind = "full"
, non-square data is provided formatrix_kind = {"triu", "tril"}
, length ofrow_names
do not match data row count, or length ofcol_names
do not match data column count.
- store_metadata(meta_md5, element, meta)#
Store metadata.
This method first loads existing metadata (if any) using
_read_metadata
and appends to it the new metadata and then saves the updated metadata using_write_processed_data
. It will only store metadata ifmeta_md5
is not found already.
- store_timeseries(meta_md5, element, data, col_names=None)#
Store timeseries.
- class junifer.storage.PandasBaseFeatureStorage(uri, single_output=True, **kwargs)#
Abstract base class for feature storage via pandas.
For every interface that is required, one needs to provide a concrete implementation of this abstract class.
- Parameters:
- uri
str
orpathlib.Path
The path to the storage.
- single_outputbool, optional
Whether to have single output (default True).
- **kwargs
Keyword arguments passed to superclass.
- uri
See also
BaseFeatureStorage
The base class for feature storage.
- static element_to_index(element, n_rows=1, rows_col_name=None)#
Convert the element metadata to index.
- Parameters:
- Returns:
pandas.Index
orpandas.MultiIndex
The index of the dataframe to store.
- get_valid_inputs()#
Get valid storage types for input.
- store_df(meta_md5, element, df)#
Implement pandas DataFrame storing.
- Parameters:
- df
pandas.DataFrame
orpandas.Series
The pandas DataFrame or Series to store.
- meta
dict
The metadata as a dictionary.
- df
- Raises:
ValueError
If the dataframe index has items that are not in the index generated from the metadata.
- store_timeseries(meta_md5, element, data, col_names=None)#
Store timeseries.
- class junifer.storage.SQLiteFeatureStorage(uri, single_output=True, upsert='update', **kwargs)#
Concrete implementation for feature storage via SQLite.
- Parameters:
- uri
str
orpathlib.Path
The path to the file to be used.
- single_outputbool, optional
If False, will create one SQLite file per element. The name of the file will be prefixed with the respective element. If True, will create only one SQLite file as specified in the
uri
and store all the elements in the same file. This behaviour is only suitable for non-parallel executions. SQLite does not support concurrency (default True).- upsert{“ignore”, “update”}, optional
Upsert mode. If “ignore” is used, the existing elements are ignored. If “update”, the existing elements are updated (default “update”).
- **kwargs
dict
The keyword arguments passed to the superclass.
- uri
See also
PandasBaseFeatureStorage
The base class for Pandas-based feature storage.
HDF5FeatureStorage
The concrete class for HDF5-based feature storage.
- collect()#
Implement data collection.
- Raises:
NotImplementedError
If
single_output
is True.
- get_engine(element=None)#
Get engine.
- Parameters:
- element
dict
, optional The element as dictionary (default None).
- element
- Returns:
sqlalchemy.engine.Engine
The sqlalchemy engine.
- list_features()#
List the features in the storage.
- read_df(feature_name=None, feature_md5=None)#
Implement feature reading into a pandas DataFrame.
Either one of
feature_name
orfeature_md5
needs to be specified.- Parameters:
- Returns:
pandas.DataFrame
The features as a dataframe.
- Raises:
ValueError
If parameter values are invalid or feature is not found or multiple features are found.
- store_df(meta_md5, element, df)#
Implement pandas DataFrame storing.
- Parameters:
- df
pandas.DataFrame
orpandas.Series
The pandas DataFrame or Series to store.
- meta
dict
The metadata as a dictionary.
- df
- Raises:
ValueError
If the dataframe index has items that are not in the index generated from the metadata.
- store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind='full', diagonal=True)#
Store matrix.
- Parameters:
- meta_md5
str
The metadata MD5 hash.
- element
dict
The element as a dictionary.
- data
numpy.ndarray
The matrix data to store.
- meta
dict
The metadata as a dictionary.
- col_names
list
ortuple
ofstr
, optional The column labels (default None).
- row_names
str
, optional The row labels (optional None).
- matrix_kind
str
, optional The kind of matrix:
triu
: store upper triangular onlytril
: store lower triangularfull
: full matrix
(default “full”).
- diagonalbool, optional
Whether to store the diagonal. If
matrix_kind = full
, setting this to False will raise an error (default True).
- meta_md5