shark::Data< Type > Class Template Reference

#include <shark/Data/Dataset.h>

Inheritance diagram for shark::Data< Type >:

Public Types
typedef batch_type &	batch_reference

typedef batch_type const &	const_batch_reference

typedef Batch< element_type >::reference	element_reference

typedef Batch< element_type >::const_reference	const_element_reference

typedef std::vector< std::size_t >	IndexSet

typedef boost::iterator_range< detail::DataElementIterator< Data< Type > > >	element_range

typedef boost::iterator_range< detail::DataElementIterator< Data< Type > const > >	const_element_range

typedef detail::BatchRange< Data< Type > >	batch_range

typedef detail::BatchRange< Data< Type > const >	const_batch_range

Public Member Functions
	BOOST_STATIC_CONSTANT (std::size_t, DefaultBatchSize=256)
	Defines the default batch size of the Container. More...

template<class T >
bool	operator== (const Data< T > &rhs)
	Two containers compare equal if they share the same data. More...

template<class T >
bool	operator!= (const Data< T > &rhs)
	Two containers compare unequal if they don't share the same data. More...

const_element_range	elements () const
	Returns the range of elements. More...

element_range	elements ()
	Returns therange of elements. More...

const_batch_range	batches () const
	Returns the range of batches. More...

batch_range	batches ()
	Returns the range of batches. More...

std::size_t	numberOfBatches () const
	Returns the number of batches of the set. More...

std::size_t	numberOfElements () const
	Returns the total number of elements. More...

Shape const &	shape () const
	Returns the shape of the elements in the dataset. More...

Shape &	shape ()
	Returns the shape of the elements in the dataset. More...

bool	empty () const
	Check whether the set is empty. More...

element_reference	element (std::size_t i)

const_element_reference	element (std::size_t i) const

batch_reference	batch (std::size_t i)

const_batch_reference	batch (std::size_t i) const

	Data ()
	Constructor which constructs an empty set. More...

	Data (std::size_t numBatches)
	Construct a dataset with empty batches. More...

	Data (std::size_t size, element_type const &element, std::size_t batchSize=DefaultBatchSize)
	Construction with size and a single element. More...

void	read (InArchive &archive)
	Read the component from the supplied archive. More...

void	write (OutArchive &archive) const
	Write the component to the supplied archive. More...

virtual void	makeIndependent ()
	This method makes the vector independent of all siblings and parents. More...

void	splitBatch (std::size_t batch, std::size_t elementIndex)

Data	splice (std::size_t batch)
	Splits the container into two independent parts. The front part remains in the container, the back part is returned. More...

void	append (Data const &other)
	Appends the contents of another data object to the end. More...

void	push_back (const_batch_reference batch)

template<class Range >
void	repartition (Range const &batchSizes)
	Reorders the batch structure in the container to that indicated by the batchSizes vector. More...

std::vector< std::size_t >	getPartitioning () const
	Creates a vector with the batch sizes of every batch. More...

void	indexedSubset (IndexSet const &indices, Data &subset, Data &complement) const
	Fill in the subset defined by the list of indices as well as its complement. More...

Data	indexedSubset (IndexSet const &indices) const

Public Member Functions inherited from shark::ISerializable
virtual	~ISerializable ()
	Virtual d'tor. More...

void	load (InArchive &archive, unsigned int version)
	Versioned loading of components, calls read(...). More...

void	save (OutArchive &archive, unsigned int version) const
	Versioned storing of components, calls write(...). More...

	BOOST_SERIALIZATION_SPLIT_MEMBER ()

Protected Types
typedef detail::SharedContainer< Type >	Container

Protected Attributes
Container	m_data
	data More...

Shape	m_shape
	shape of a datapoint More...

Friends
template<class InputT , class LabelT >
class	LabeledData

void	swap (Data &a, Data &b)

Detailed Description

template<class Type>
class shark::Data< Type >

Data container.

The Data class is Shark's container for machine learning data. This container (and its sub-classes) is used for input data, labels, and model outputs.

: The Data container organizes the data it holds in batches. This means, that it tries to find a good data representation for a whole set of, for example 100 data points, at the same time. If the type of data it stores is for example RealVector, the batches of this type are RealMatrices. This is good because most often operations on the whole matrix are faster than operations on the separate vectors. Nearly all operations of the set have to be interpreted in terms of the batch. Therefore the iterator interface will give access to the batches but not to single elements. For this separate element_iterators and const_element_iterators can be used.

: When you need to explicitely iterate over all elements, you can use:
Data<RealVector> data;
for(auto elem: data.elements()){
std::cout<<*pos<<" ";
auto ref=*pos;
ref*=2;
std::cout<<*pos<<std::endl;
}

: Element wise accessing of elements is usually slower than accessing the batches. If possible, use direct batch access, or at least use the iterator interface or the for loop above to iterate over all elements. Random access to single elements is linear time, so use it wisely. Of course, when you want to use batches, you need to know the actual batch type. This depends on the actual type of the input. here are the rules: if the input is an arithmetic type like int or double, the result will be a vector of this (i.e. double->RealVector or Int->IntVector). For vectors the results are matrices as mentioned above. If the vector is sparse, so is the matrix. And for everything else the batch type is just a std::vector of the type, so no optimization can be applied.

: When constructing the container the batchSize can be set. If it is not set by the user the default batchSize is chosen. A BatchSize of 0 corresponds to putting all data into a single batch. Beware that not only the data needs storage but also the various models during computation. So the actual amount of space to compute a batch can greatly exceed the batch size.

An additional feature of the Data class is that it can be used to create lazy subsets. So the batches of a dataset can be shared between various instances of the data class without additional memory overhead.

Warning: Be aware –especially for derived containers like LabeledData– that the set does not enforce structural consistency. When you change the structure of the data part for example by directly changing the size of the batches, the size of the labels is not enforced to change accordingly. Also when creating subsets of a set changing the parent will change it's siblings and conversely. The programmer needs to ensure structural integrity! For example this is dangerous:
void function(Data<unsigned int>& data){
Data<unsigned int> newData(...);
data=newData;
}
When data was originally a labeledData object, and newData has a different batch structure than data, this will lead to structural inconsistencies. When function is rewritten such that newData has the same structure as data, this code is perfectly fine. The best way to get around this problem is by rewriting the code as:
Data<unsigned int> function(){
Data<unsigned int> newData(...);
return newData;
}

Todo:: expand docu

Definition at line 128 of file Dataset.h.

The documentation for this class was generated from the following file:

include/shark/Data/Dataset.h

Public Types

Public Member Functions

Protected Types

Protected Attributes

Friends

Detailed Description

template<class Type> class shark::Data< Type >

template<class Type>
class shark::Data< Type >