|
| BOOST_STATIC_CONSTANT (std::size_t, DefaultBatchSize=256) |
| Defines the default batch size of the Container. More...
|
|
template<class T > |
bool | operator== (const Data< T > &rhs) |
| Two containers compare equal if they share the same data. More...
|
|
template<class T > |
bool | operator!= (const Data< T > &rhs) |
| Two containers compare unequal if they don't share the same data. More...
|
|
const_element_range | elements () const |
| Returns the range of elements. More...
|
|
element_range | elements () |
| Returns therange of elements. More...
|
|
const_batch_range | batches () const |
| Returns the range of batches. More...
|
|
batch_range | batches () |
| Returns the range of batches. More...
|
|
std::size_t | numberOfBatches () const |
| Returns the number of batches of the set. More...
|
|
std::size_t | numberOfElements () const |
| Returns the total number of elements. More...
|
|
Shape const & | shape () const |
| Returns the shape of the elements in the dataset. More...
|
|
Shape & | shape () |
| Returns the shape of the elements in the dataset. More...
|
|
bool | empty () const |
| Check whether the set is empty. More...
|
|
element_reference | element (std::size_t i) |
|
const_element_reference | element (std::size_t i) const |
|
batch_reference | batch (std::size_t i) |
|
const_batch_reference | batch (std::size_t i) const |
|
| Data () |
| Constructor which constructs an empty set. More...
|
|
| Data (std::size_t numBatches) |
| Construct a dataset with empty batches. More...
|
|
| Data (std::size_t size, element_type const &element, std::size_t batchSize=DefaultBatchSize) |
| Construction with size and a single element. More...
|
|
void | read (InArchive &archive) |
| Read the component from the supplied archive. More...
|
|
void | write (OutArchive &archive) const |
| Write the component to the supplied archive. More...
|
|
virtual void | makeIndependent () |
| This method makes the vector independent of all siblings and parents. More...
|
|
void | splitBatch (std::size_t batch, std::size_t elementIndex) |
|
Data | splice (std::size_t batch) |
| Splits the container into two independent parts. The front part remains in the container, the back part is returned. More...
|
|
void | append (Data const &other) |
| Appends the contents of another data object to the end. More...
|
|
void | push_back (const_batch_reference batch) |
|
template<class Range > |
void | repartition (Range const &batchSizes) |
| Reorders the batch structure in the container to that indicated by the batchSizes vector. More...
|
|
std::vector< std::size_t > | getPartitioning () const |
| Creates a vector with the batch sizes of every batch. More...
|
|
void | indexedSubset (IndexSet const &indices, Data &subset, Data &complement) const |
| Fill in the subset defined by the list of indices as well as its complement. More...
|
|
Data | indexedSubset (IndexSet const &indices) const |
|
virtual | ~ISerializable () |
| Virtual d'tor. More...
|
|
void | load (InArchive &archive, unsigned int version) |
| Versioned loading of components, calls read(...). More...
|
|
void | save (OutArchive &archive, unsigned int version) const |
| Versioned storing of components, calls write(...). More...
|
|
| BOOST_SERIALIZATION_SPLIT_MEMBER () |
|
template<class Type>
class shark::Data< Type >
Data container.
The Data class is Shark's container for machine learning data. This container (and its sub-classes) is used for input data, labels, and model outputs.
- The Data container organizes the data it holds in batches. This means, that it tries to find a good data representation for a whole set of, for example 100 data points, at the same time. If the type of data it stores is for example RealVector, the batches of this type are RealMatrices. This is good because most often operations on the whole matrix are faster than operations on the separate vectors. Nearly all operations of the set have to be interpreted in terms of the batch. Therefore the iterator interface will give access to the batches but not to single elements. For this separate element_iterators and const_element_iterators can be used.
- When you need to explicitely iterate over all elements, you can use:
Data<RealVector> data;
for(auto elem: data.elements()){
std::cout<<*pos<<" ";
auto ref=*pos;
ref*=2;
std::cout<<*pos<<std::endl;
}
- Element wise accessing of elements is usually slower than accessing the batches. If possible, use direct batch access, or at least use the iterator interface or the for loop above to iterate over all elements. Random access to single elements is linear time, so use it wisely. Of course, when you want to use batches, you need to know the actual batch type. This depends on the actual type of the input. here are the rules: if the input is an arithmetic type like int or double, the result will be a vector of this (i.e. double->RealVector or Int->IntVector). For vectors the results are matrices as mentioned above. If the vector is sparse, so is the matrix. And for everything else the batch type is just a std::vector of the type, so no optimization can be applied.
- When constructing the container the batchSize can be set. If it is not set by the user the default batchSize is chosen. A BatchSize of 0 corresponds to putting all data into a single batch. Beware that not only the data needs storage but also the various models during computation. So the actual amount of space to compute a batch can greatly exceed the batch size.
An additional feature of the Data class is that it can be used to create lazy subsets. So the batches of a dataset can be shared between various instances of the data class without additional memory overhead.
- Warning
- Be aware –especially for derived containers like LabeledData– that the set does not enforce structural consistency. When you change the structure of the data part for example by directly changing the size of the batches, the size of the labels is not enforced to change accordingly. Also when creating subsets of a set changing the parent will change it's siblings and conversely. The programmer needs to ensure structural integrity! For example this is dangerous:
void function(Data<unsigned int>& data){
Data<unsigned int> newData(...);
data=newData;
}
When data was originally a labeledData object, and newData has a different batch structure than data, this will lead to structural inconsistencies. When function is rewritten such that newData has the same structure as data, this code is perfectly fine. The best way to get around this problem is by rewriting the code as: Data<unsigned int> function(){
Data<unsigned int> newData(...);
return newData;
}
- Todo:
- expand docu
Definition at line 128 of file Dataset.h.