Requirements of imported files

Csv file specifications

The csv (comma-separated values) file format is a simple text file format that can be viewed and edited easily by many programs and libraries.
The first line of the file gives the name of features, separated by commas. Each following line gives the values of each feature for a single example, separated by commas.

A csv file can be used to upload every type of feature that the platform supports:

  • Single numeric values.

  • Categorical values: a text string or number that is identical for all the examples of the dataset belonging to the same category.

  • Text values: a string of text in natural language.
    If the text feature contains a comma, enclose the whole text between double quotes to avoid splitting the text across several features. If the text feature contains double quotes, replace them with two double-quotes.
    For example: "He said ""This is a text feature"", and pressed upload."

  • Images: specify the name (and path) of images and zip them together with the csv file.

  • Multi-dimensional numeric values (i.e. tensors): specify the name (and path) of npy files and zip them together with the csv file.

Text file in the csv format

Csv restrictions

  • Save your csv files using the UTF-8 encoding.
    Files saved with different encoding might work on the platform, but special characters may cause incompatibilities.

  • Use less than 50000 characters in a single feature.
    This is particularly true if you upload a text feature, where the text could be arbitrarily long. Consider that no model could handle such long features anyway.

  • Be aware that the platform supports up to 2000 unique categories per feature, but datasets with more categories will not run properly.

  • Try to limit your datasets to less than 10 millions rows. You will most likely hit very long training times, and it is rare that a model needs so many examples to train itself.

Images specifications

The Platform supports images in the jpg/jpeg, png, and gif file formats.

Image size and color

There is no requirement on the resolution or color of the images that you upload, so that you can mix any combination of images in the same dataset.

Every example will be converted using the Transformation and Color mode options to make sure that your models are provided with the fixed size of data that they need.

  • The resolution of the images, i.e., the width and height in pixels, is controlled on platform by the Transformation option.

  • The amount of channels, i.e., grayscale or color with transparency or not, is controlled by the Color mode option.

Local image files

Images that you have on your local computer need to be archived inside a zip file, either with an index file or without an index file.

Image files stored online

Images stored on Google Cloud Storage or on Amazon S3 can be imported by the Platform if you specify their URL as a feature, as long as they are public.
See making data public for Google Cloud and Amazon S3.

  • Google Cloud Storage URLs start with gs://

  • Amazon S3 URLs start with s3://

This can be done with either an index csv file, or inside a data warehouse table.


Importing images from Google Cloud Storage addresses.

Npy file specifications

The npy file is a binary file format that can save a NumPy array, where NumPy is the Python library that is commonly used in machine learning.

You can use npy files to upload a single feature, which can be a single numeric value or a multi-dimensional array of numeric values (i.e., a tensor).

Single npy file for all examples

If you upload a single npy file, the dataset’s examples are assumed to be arranged along the first dimension.
For example: if the npy array has the shape (1000, 20, 10, 3), then the platform will treat it as 1000 examples of a tensor feature with shape (20, 10, 3).

If your examples have sevral features, you can upload several npy file. They will be added following the multiple file upload rules.

One npy file per example in a zip file

If it’s more convenient, you can also upload 1 npy file per example. In that case, specify the individual npy file names as a feature in a csv file and upload everything in a zip file.

Npy restrictions

  • The platform only supports 32-bit floating point numbers, because this is the standard format on GPUs and it makes calculations run faster.
    If you work with Python and NumPy, it is likely that the default format is, instead, 64-bit floating point numbers. Make sure that you convert your values to 32-bit floats when you save your arrays to npy files:

    import numpy as np
    A = np.array([1.0, 1.1, 1.2, 1.3, 1.4, 1.5])'/my/folder/data.npy', A.astype('float32'))
  • The NumPy arrays must also be saved using little-endian byte order and C-ordering. These are likely correct by default, but if you get errors about byte order of Fortran order during upload, check byte swapping and the ascontiguousarray() function from NumPy.

  • Visualisation of npy files depends on the number of dimensions. Find out more in known issues.

Zip file specifications

The zip file is used to bundle together a combination of csv, image, and npy files.

  • If a csv file exists inside the zip, it is imported as a regular csv file. If some (or all) features in this file are paths to images (jpg or png) or npy files that exist inside the zip, these files will be imported as features.

    If only some of the image files specified exist inside the zip, they will be imported and the missing images will create empty features. You will not be able to run an experiment that requires this feature, since the input should have the same format for all examples.

  • If several csv files exist inside the zip, you must name one of them index.csv and this file will be used. Otherwise import will fail.


Zip file structure containing images and an index file.
  • If the zip file does not contain any csv file, the platform will try to import all the image files that exists inside the zip as long as they are organized in a folder and subfolder structure. In that case, one or more categorical features will be created based on the folder and subfolder path.

    If the zip file contains different folder structures, e.g. some images are located in sub-folders and other images are located in sub-sub-folders, then the platform will pick one of the folder structures and ignore files located in the other one.


Zip file containing images but no index.

The first image files found are located in the second level of subfolders. Thus, the categorical features category_0 and category_1 are automatically created based on the folder names.

However, the car images are located in only one level of subfolder. As a result, they are not imported in the dataset.