9.1.5. Storage¶

Storages for storing extracted features.

class junifer.storage.BaseFeatureStorage(uri, storage_types, single_output=True)¶

Abstract base class for feature storage.

For every interface that is required, one needs to provide a concrete implementation of this abstract class.

Parameters:

uristr or pathlib.Path: The path to the storage.
storage_typesstr or list of str: The available storage types for the class.
single_outputbool, optional: Whether to have single output (default True).

Raises:

ValueError: If required storage type(s) is(are) missing from storage_types.

abstract collect()¶

Collect data.

get_valid_inputs()¶

Get valid storage types for input.

Returns:

list of str: The list of storage types that can be used as input for this storage interface.

abstract list_features()¶

List the features in the storage.

Returns:

dict: List of features in the storage. The keys are the feature MD5 to be used in read_df() and the values are the metadata of each feature.

abstract read(feature_name=None, feature_md5=None)¶

Read stored feature.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

dict: The stored feature as a dictionary.

abstract read_df(feature_name=None, feature_md5=None)¶

Read feature into a pandas DataFrame.

Parameters:

feature_namestr, optional: Name of the feature to read (default None).
feature_md5str, optional: MD5 hash of the feature to read (default None).

Returns:

pandas.DataFrame: The features as a dataframe.

store(kind, **kwargs)¶

Store extracted features data.

Parameters:

kind{“matrix”, “timeseries”, “vector”, “scalar_table”}: The storage kind.
**kwargs: The keyword arguments.

Raises:

ValueError: If kind is invalid.

store_matrix(meta_md5, element, data, col_names=None, row_names=None, matrix_kind='full', diagonal=True)¶

Store matrix.

Parameters:

meta_md5str

The metadata MD5 hash.

elementdict

The element as a dictionary.

datanumpy.ndarray

The matrix data to store.

col_nameslist or tuple of str, optional

The column labels (default None).

row_namesstr, optional

The row labels (default None).

matrix_kindstr, optional

The kind of matrix:

triu : store upper triangular only
tril : store lower triangular
full : full matrix

(default “full”).

diagonalbool, optional

Whether to store the diagonal. If matrix_kind = full, setting this to False will raise an error (default True).

abstract store_metadata(meta_md5, element, meta)¶

Store metadata.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
metadict: The metadata as a dictionary.

store_scalar_table(meta_md5, element, data, col_names=None, row_names=None, row_header_col_name='feature')¶

Store table with scalar values.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The timeseries data to store.
col_nameslist or tuple of str, optional: The column labels (default None).
row_namesstr, optional: The row labels (default None).
row_header_col_namestr, optional: The column name for the row header column (default “feature”).

store_timeseries(meta_md5, element, data, col_names=None)¶

Store timeseries.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray: The timeseries data to store.
col_nameslist or tuple of str, optional: The column labels (default None).

store_vector(meta_md5, element, data, col_names=None)¶

Store vector.

Parameters:

meta_md5str: The metadata MD5 hash.
elementdict: The element as a dictionary.
datanumpy.ndarray or list: The vector data to store.
col_nameslist or tuple of str, optional: The column labels (default None).

validate(input_)¶

Validate the input to the pipeline step.

Parameters:

input_list of str: The input to the pipeline step.

Raises:

ValueError: If the input_ is invalid.

class junifer.storage.HDF5FeatureStorage(uri, single_output=True, overwrite='update', compression=7, force_float32=True, chunk_size=100)¶

Concrete implementation for feature storage via HDF5.

Parameters:

uristr or pathlib.Path: The path to the file to be used.
single_outputbool, optional: If False, will create one HDF5 file per element. The name of the file will be prefixed with the respective element. If True, will create only one HDF5 file as specified in the uri and store all the elements in the same file. Concurrent writes should be handled with care (default True).
overwritebool or “update”, optional: Whether to overwrite existing file. If True, will overwrite and if “update”, will update existing entry or append (default “update”).
compression{0-9}, optional: Level of gzip compression: 0 (lowest) to 9 (highest) (default 7).
force_float32bool, optional: Whether to force casting of numpy.ndarray values to float32 if float64 values are found (default True).
chunk_sizeint, optional: The chunk size to use when collecting data from element files in collect(). If the file count is smaller than the value, the minimum is used (default 100).