Currently, we do not support the deployments cases for low latency (< 1s) applications or high volume lookups.
The Peltarion Platform limitation is 30 s hard timeout for requests and max 200 MB total request payload.
For large models, latency will be dominated by the execution time of the model, while the overhead of the request itself will tend to be the dominating factor for small models. The first time a model is invoked via the API it is loaded into memory which will add some extra latency proportional to the size and complexity of the model. Once the model is loaded it will stay loaded for a while and subsequent requests will be quicker. After some time if not used the model may be unloaded and the extra latency to load the model again will be incurred on the next request.
When sending input using JSON it is possible to provide a batch of samples in a single request that will be processed together. The time to execute the batch is proportional to the number of samples provided but the request overhead will be amortized over the entire batch. Thus, using batches is much more efficient if many samples need to be processed, especially if the model is not so big.
Input data is required to be in exactly the same format as the data used for training.
Example: If 28x28 grayscale pixel-images were used, then 28x28 grayscale pixel-images need to be posted.
Example: If a [1000, 10, 10, 3] npy-tensor was used for training (i.e. 1000 examples of, [10, 10, 3] tensors) then a [1, 10, 10, 3] npy of the same data type need to be posted.