9.1.5. Storage#

Storages for storing extracted features.

class junifer.storage.BaseFeatureStorage(uri, storage_types, single_output=True)#

Abstract base class for feature storage.

For every interface that is required, one needs to provide a concrete implementation of this abstract class.

Parameters:
uristr or pathlib.Path

The path to the storage.

storage_typesstr or list of str

The available storage types for the class.

single_outputbool, optional

Whether to have single output (default True).

Raises:
ValueError

If required storage type(s) is(are) missing from storage_types.

abstract collect()#

Collect data.

get_valid_inputs()#

Get valid storage types for input.

Returns:
list of str

The list of storage types that can be used as input for this storage interface.

abstract list_features()#

List the features in the storage.

Returns:
dict

List of features in the storage. The keys are the feature MD5 to be used in read_df() and the values are the metadata of each feature.

abstract read(feature_name=None, feature_md5=None)#

Read stored feature.

Parameters:
feature_namestr, optional

Name of the feature to read (default None).

feature_md5str, optional

MD5 hash of the feature to read (default None).

Returns:
dict

The stored feature as a dictionary.

abstract read_df(feature_name=None, feature_md5=None)#

Read feature into a pandas DataFrame.

Parameters:
feature_namestr, optional

Name of the feature to read (default None).

feature_md5str, optional

MD5 hash of the feature to read (default None).

Returns:
pandas.DataFrame

The features as a dataframe.

store(kind, **kwargs)#

Store extracted features data.

Parameters:
kind{“matrix”, “timeseries”, “vector”}

The storage kind.

**kwargs

The keyword arguments.

Raises:
ValueError

If kind is invalid.

store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind='full', diagonal=True)#

Store matrix.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

datanumpy.ndarray

The matrix data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

row_namesstr, optional

The row labels (default None).

matrix_kindstr, optional

The kind of matrix:

  • triu : store upper triangular only

  • tril : store lower triangular

  • full : full matrix

(default “full”).

diagonalbool, optional

Whether to store the diagonal. If matrix_kind = full, setting this to False will raise an error (default True).

abstract store_metadata(meta_md5, element, meta)#

Store metadata.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

metadict

The metadata as a dictionary.

store_timeseries(meta_md5, element, data, col_names=None)#

Store timeseries.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

datanumpy.ndarray

The timeseries data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

store_vector(meta_md5, element, data, col_names=None)#

Store vector.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

datanumpy.ndarray or list

The vector data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

validate(input_)#

Validate the input to the pipeline step.

Parameters:
input_list of str

The input to the pipeline step.

Raises:
ValueError

If the input_ is invalid.

class junifer.storage.HDF5FeatureStorage(uri, single_output=True, overwrite='update', compression=7, force_float32=True, chunk_size=100)#

Concrete implementation for feature storage via HDF5.

Parameters:
uristr or pathlib.Path

The path to the file to be used.

single_outputbool, optional

If False, will create one HDF5 file per element. The name of the file will be prefixed with the respective element. If True, will create only one HDF5 file as specified in the uri and store all the elements in the same file. Concurrent writes should be handled with care (default True).

overwritebool or “update”, optional

Whether to overwrite existing file. If True, will overwrite and if “update”, will update existing entry or append (default “update”).

compression{0-9}, optional

Level of gzip compression: 0 (lowest) to 9 (highest) (default 7).

force_float32bool, optional

Whether to force casting of numpy.ndarray values to float32 if float64 values are found (default True).

chunk_sizeint, optional

The chunk size to use when collecting data from element files in collect(). If the file count is smaller than the value, the minimum is used (default 100).

See also

SQLiteFeatureStorage

The concrete class for SQLite-based feature storage.

collect()#

Implement data collection.

This method globs the element files and runs a loop over them while reading metadata and then runs a loop over all the stored features in the metadata, storing the metadata and the feature data right after reading.

Raises:
NotImplementedError

If single_output is True.

get_valid_inputs()#

Get valid storage types for input.

Returns:
list of str

The list of storage types that can be used as input for this storage.

list_features()#

List the features in the storage.

Returns:
dict

List of features in the storage. The keys are the feature MD5 to be used in read_df() and the values are the metadata of each feature.

read(feature_name=None, feature_md5=None)#

Read stored feature.

Parameters:
feature_namestr, optional

Name of the feature to read (default None).

feature_md5str, optional

MD5 hash of the feature to read (default None).

Returns:
dict

The stored feature as a dictionary.

Raises:
IOError

If HDF5 file does not exist.

read_df(feature_name=None, feature_md5=None)#

Read feature into a pandas.DataFrame.

Either one of feature_name or feature_md5 needs to be specified.

Parameters:
feature_namestr, optional

Name of the feature to read (default None).

feature_md5str, optional

MD5 hash of the feature to read (default None).

Returns:
pandas.DataFrame

The features as a dataframe.

Raises:
IOError

If HDF5 file does not exist.

store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind='full', diagonal=True, row_header_col_name='ROI')#

Store matrix.

This method performs parameter checks and then calls _store_data for storing the data.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as dictionary.

datanumpy.ndarray

The matrix data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

row_nameslist or tuple of str, optional

The row labels (default None).

matrix_kindstr, optional

The kind of matrix:

  • triu : store upper triangular only

  • tril : store lower triangular

  • full : full matrix

(default “full”).

diagonalbool, optional

Whether to store the diagonal. If matrix_kind is “full”, setting this to False will raise an error (default True).

row_header_col_namestr, optional

The column name for the row header column (default “ROI”).

Raises:
ValueError

If invalid matrix_kind is provided, diagonal = False for matrix_kind = "full", non-square data is provided for matrix_kind = {"triu", "tril"}, length of row_names do not match data row count, or length of col_names do not match data column count.

store_metadata(meta_md5, element, meta)#

Store metadata.

This method first loads existing metadata (if any) using _read_metadata and appends to it the new metadata and then saves the updated metadata using _write_processed_data. It will only store metadata if meta_md5 is not found already.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

metadict

The metadata as a dictionary.

store_timeseries(meta_md5, element, data, col_names=None)#

Store timeseries.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as dictionary.

datanumpy.ndarray

The timeseries data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

store_vector(meta_md5, element, data, col_names=None)#

Store vector.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as dictionary.

datanumpy.ndarray or list

The vector data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

class junifer.storage.PandasBaseFeatureStorage(uri, single_output=True, **kwargs)#

Abstract base class for feature storage via pandas.

For every interface that is required, one needs to provide a concrete implementation of this abstract class.

Parameters:
uristr or pathlib.Path

The path to the storage.

single_outputbool, optional

Whether to have single output (default True).

**kwargs

Keyword arguments passed to superclass.

See also

BaseFeatureStorage

The base class for feature storage.

static element_to_index(element, n_rows=1, rows_col_name=None)#

Convert the element metadata to index.

Parameters:
elementdict

The element as a dictionary.

n_rowsint, optional

Number of rows to create (default 1).

rows_col_name: str, optional

The column name to use in case n_rows > 1. If None and n_rows > 1, the name will be “idx” (default None).

Returns:
pandas.Index or pandas.MultiIndex

The index of the dataframe to store.

get_valid_inputs()#

Get valid storage types for input.

Returns:
list of str

The list of storage types that can be used as input for this storage interface.

store_df(meta_md5, element, df)#

Implement pandas DataFrame storing.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

dfpandas.DataFrame or pandas.Series

The pandas DataFrame or Series to store.

Raises:
ValueError

If the dataframe index has items that are not in the index generated from the metadata.

store_timeseries(meta_md5, element, data, col_names=None)#

Store timeseries.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

datanumpy.ndarray

The timeseries data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

store_vector(meta_md5, element, data, col_names=None)#

Store vector.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

datanumpy.ndarray or list

The vector data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

class junifer.storage.SQLiteFeatureStorage(uri, single_output=True, upsert='update', **kwargs)#

Concrete implementation for feature storage via SQLite.

Parameters:
uristr or pathlib.Path

The path to the file to be used.

single_outputbool, optional

If False, will create one SQLite file per element. The name of the file will be prefixed with the respective element. If True, will create only one SQLite file as specified in the uri and store all the elements in the same file. This behaviour is only suitable for non-parallel executions. SQLite does not support concurrency (default True).

upsert{“ignore”, “update”}, optional

Upsert mode. If “ignore” is used, the existing elements are ignored. If “update”, the existing elements are updated (default “update”).

**kwargsdict

The keyword arguments passed to the superclass.

See also

PandasBaseFeatureStorage

The base class for Pandas-based feature storage.

HDF5FeatureStorage

The concrete class for HDF5-based feature storage.

collect()#

Implement data collection.

Raises:
NotImplementedError

If single_output is True.

get_engine(element=None)#

Get engine.

Parameters:
elementdict, optional

The element as dictionary (default None).

Returns:
sqlalchemy.engine.Engine

The sqlalchemy engine.

list_features()#

List the features in the storage.

Returns:
dict

List of features in the storage. The keys are the feature MD5 to be used in read_df() and the values are the metadata of each feature.

read(feature_name=None, feature_md5=None)#

Read stored feature.

Parameters:
feature_namestr, optional

Name of the feature to read (default None).

feature_md5str, optional

MD5 hash of the feature to read (default None).

Returns:
dict

The stored feature as a dictionary.

read_df(feature_name=None, feature_md5=None)#

Implement feature reading into a pandas DataFrame.

Either one of feature_name or feature_md5 needs to be specified.

Parameters:
feature_namestr, optional

Name of the feature to read (default None).

feature_md5str, optional

MD5 hash of the feature to read (default None).

Returns:
pandas.DataFrame

The features as a dataframe.

Raises:
ValueError

If parameter values are invalid or feature is not found or multiple features are found.

store_df(meta_md5, element, df)#

Implement pandas DataFrame storing.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

dfpandas.DataFrame or pandas.Series

The pandas DataFrame or Series to store.

Raises:
ValueError

If the dataframe index has items that are not in the index generated from the metadata.

store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind='full', diagonal=True)#

Store matrix.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

datanumpy.ndarray

The matrix data to store.

metadict

The metadata as a dictionary.

col_nameslist or tuple of str, optional

The column labels (default None).

row_namesstr, optional

The row labels (optional None).

matrix_kindstr, optional

The kind of matrix:

  • triu : store upper triangular only

  • tril : store lower triangular

  • full : full matrix

(default “full”).

diagonalbool, optional

Whether to store the diagonal. If matrix_kind = full, setting this to False will raise an error (default True).

store_metadata(meta_md5, element, meta)#

Implement metadata storing in the storage.

Parameters:
meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

metadict

The metadata as a dictionary.