Requirements of imported files

You can upload different types of files into the platform:

  • zip: A compressed zip file, containing one or more csv, image, or npy files.

  • csv: A comma-separated valued text file. Separate values with commas.
    Use for tabular data.
    The easiest way to upload data with a different kind of encoding, e.g., text or integer.

  • images: The platform supports images in the jpg/jpeg, png, and gif file formats.

  • npy: A saved NumPy array file, where different examples are listed along the first array dimension.

Preprocessing help
Our repository, Sidekick helps you fix your data before you upload it. Sidekick’s aim is to make it easier to get data into the platform.

Zip file specifications

The zip file is used to bundle together files to import the files to the platform.

Zip with index.csv including file paths

The zip file must include a file named index.csv.
The index.csv must include paths to the files in the zip.

Example: MNIST.zip including index.csv used in the MNIST tutorial. Download and inspect the structure if you want to.

If only some of the image files specified exist inside the zip, they will be imported, and the missing images will create empty features. You will not be able to run an experiment that requires this feature since the input should have the same format for all examples.

Example:

Zip file structure containing images and an index file.

Zip with no csv

If the zip file does not contain any csv file, you must organize the included files in a folder- and subfolder-structure.
During the import, one or more categorical features will be created based on the folder and subfolder path; see example below.

Same structure in zip

Include all files in one category on one folder and name the folder. After you’ve imported the zip, all files in one folder will get the same label as the folder name.

Same level zip structure
Figure 1. Same one-level structure in the zip

Example:
Simple zip with MNIST images in a folder structure. Download and inspect the structure if you want to.

Different folder structures

Suppose the zip file contains different folder structures. In that case, e.g., some images are located in sub-folders, and other images are located in sub-sub-folders, the platform will pick one of the folder structures and ignore files located in the other one.

Example:
The first image files found are located in the second level of subfolders. Thus, the categorical features category_0 and category_1 are automatically created based on the folder names.
However, the car images are located in only one level of subfolder. As a result, they are not imported into the dataset.

Zip file containing images but no index.
Figure 2. Due to different folder structures, some images won’t be imported.

Zip with only one csv

If a csv file exists inside the zip, it is imported as a regular csv file.

If some (or all) features in the csv are paths to images (jpg or png) or npy files that exist inside the zip, these files will be imported as features.

Csv file specifications

The csv (comma-separated values) file format is a simple text file format that can be viewed and edited easily by many programs and libraries.

Csv structure

The first line of the file gives the name of features, separated by commas.
Each following line gives the values of each feature for a single example, separated by commas.

CSV example:

purchased,age,job,education
no,32,entrepreneur,university.degree
no,34,blue-collar,basic.4y
yes,27,technician,basic.9y

Csv usage

A csv file can be used to upload every type of feature that the platform supports:

  • Single numeric values.

  • Categorical values: a text string or number that is identical for all the examples of the dataset belonging to the same category.

  • Text values: a string of text in natural language.
    If the text feature contains a comma, enclose the whole text between double quotes to avoid splitting the text across several features. If the text feature contains double quotes, replace them with two double-quotes.
    For example: "He said ""This is a text feature"", and pressed upload."

  • Images: specify the name (and path) of images and zip them together with the csv file.

  • Multi-dimensional numeric values (i.e. tensors): specify the name (and path) of npy files and zip them together with the csv file.

Text file in the csv format

Csv restrictions

  • Save your csv files using the UTF-8 encoding.
    Files saved with different encoding might work on the platform, but special characters may cause incompatibilities.

  • No rows with empty values are allowed. Use our DataCleaner tool to remove rows with empty values. Read more in the DataCleaner article.

  • Use less than 50000 characters in a single feature.
    This is particularly true if you upload a text feature, where the text could be arbitrarily long. Consider that no model could handle such long features anyway.

  • Be aware that the platform supports up to 2000 unique categories per feature, but datasets with more categories will not run properly.

  • Try to limit your datasets to less than 10 millions rows. You will most likely hit very long training times, and it is rare that a model needs so many examples to train itself.

  • Use English style decimal point (.), for example, Pi=3.14.
    Some languages, for example, Spanish and Swedish, use a decimal comma (,) instead of a decimal point, for example, Pi=3,14. This might cause problems when you use a csv exported from a Spanish or Swedish version of Excel (or similar).

DataCleaner tool to remove empty rows

We’ve created the DataCleaner tool to help you prepare data before you upload it to the Peltarion Platform. The tool removes rows in your tabular data that contain empty values.

Images specifications

The platform supports images in the jpg/jpeg, png, and gif file formats.

Image size and color

There is no requirement on the resolution or color of the images that you upload, so that you can mix any combination of images in the same dataset.

Every example will be converted using the Transformation and Color mode options to make sure that your models are provided with the fixed size of data that they need.

  • The resolution of the images, i.e., the width and height in pixels, is controlled on the platform by the Transformation option.

  • The amount of channels, i.e., grayscale or color with transparency or not, is controlled by the Color mode option.

Local image files

Images that you have on your local computer need to be archived inside a zip file, either with an index file or without an index file.

Image files stored online

Images stored on Google Cloud Storage or on Amazon S3 can be imported by the platform if you specify each image URL as a feature, as long as they are public.
See making data public for Google Cloud and Amazon S3.

  • Google Cloud Storage URLs start with gs://

  • Amazon S3 URLs start with s3://

This can be done with either an index csv file, or inside a data warehouse table, for example, Azure or BigQuery.

Example:

Importing images from Google Cloud Storage addresses.

Npy file specifications

The npy file is a binary file format that can save a NumPy array, where NumPy is the Python library that is commonly used in machine learning.

You can use npy files to upload a single feature, which can be a single numeric value or a multi-dimensional array of numeric values (i.e., a tensor).

Single npy file for all examples

If you upload a single npy file, the dataset’s examples are assumed to be arranged along the first dimension.

Example:
If the npy array has the shape (1000, 20, 10, 3), then the platform will treat it as 1000 examples of a tensor feature with shape (20, 10, 3).

If your examples have several features, you can upload several npy files. They will be added following the multiple file upload rules.

One npy file per example in a zip file

If it’s more convenient, you can also upload 1 npy file per example. In that case, specify the individual npy file names as a feature in a csv file and upload everything in a zip file.

Npy restrictions

  • The platform only supports 32-bit floating point numbers, because this is the standard format on GPUs and it makes calculations run faster.
    If you work with Python and NumPy, it is likely that the default format is, instead, 64-bit floating point numbers. Make sure that you convert your values to 32-bit floats when you save your arrays to npy files:

    import numpy as np
    A = np.array([1.0, 1.1, 1.2, 1.3, 1.4, 1.5])
    np.save('/my/folder/data.npy', A.astype('float32'))
  • The NumPy arrays must also be saved using little-endian byte order and C-ordering. These are likely correct by default, but if you get errors about byte order of Fortran order during upload, check byte swapping and the ascontiguousarray() function from NumPy.

  • Visualization of npy files depends on the number of dimensions. Find out more in known issues.

Was this page helpful?
YesNo