Requirements of imported files
Csv file specifications
The csv (comma-separated values) file format is a simple text file format that can be viewed and edited easily by many programs and libraries.
The first line of the file gives the name of features, separated by commas. Each following line gives the values of each feature for a single example, separated by commas.
A csv file can be used to upload every type of feature that the platform supports:
Single numeric values.
Categorical values: a text string or number that is identical for all the examples of the dataset belonging to the same category.
Text values: a string of text in natural language.
If the text feature contains a comma, enclose the whole text between double quotes to avoid splitting the text across several features. If the text feature contains double quotes, replace them with two double-quotes.
"He said ""This is a text feature"", and pressed upload."
Images: specify the name (and path) of images and zip them together with the csv file.
Multi-dimensional numeric values (i.e. tensors): specify the name (and path) of npy files and zip them together with the csv file.
Save your csv files using the UTF-8 encoding.
Files saved with different encoding might work on the platform, but special characters may cause incompatibilities.
Use less than 50000 characters in a single feature.
This is particularly true if you upload a text feature, where the text could be arbitrarily long. Consider that no model could handle such long features anyway.
Be aware that the platform supports up to 2000 unique categories per feature, but datasets with more categories will not run properly.
Try to limit your datasets to less than 10 millions rows. You will most likely hit very long training times, and it is rare that a model needs so many examples to train itself.
The Platform supports images in the jpg/jpeg, png, and gif file formats.
Image size and color
There is no requirement on the resolution or color of the images that you upload, so that you can mix any combination of images in the same dataset.
Every example will be converted using the Transformation and Color mode options to make sure that your models are provided with the fixed size of data that they need.
Local image files
Image files stored online
Images stored on Google Cloud Storage or on Amazon S3 can be imported by the Platform if you specify their URL as a feature, as long as they are public.
See making data public for Google Cloud and Amazon S3.
Google Cloud Storage URLs start with
Amazon S3 URLs start with
Npy file specifications
The npy file is a binary file format that can save a NumPy array, where NumPy is the Python library that is commonly used in machine learning.
You can use npy files to upload a single feature, which can be a single numeric value or a multi-dimensional array of numeric values (i.e., a tensor).
Single npy file for all examples
If you upload a single npy file, the dataset’s examples are assumed to be arranged along the first dimension.
For example: if the npy array has the shape (1000, 20, 10, 3), then the platform will treat it as 1000 examples of a tensor feature with shape (20, 10, 3).
If your examples have sevral features, you can upload several npy file. They will be added following the multiple file upload rules.
One npy file per example in a zip file
If it’s more convenient, you can also upload 1 npy file per example. In that case, specify the individual npy file names as a feature in a csv file and upload everything in a zip file.
The platform only supports 32-bit floating point numbers, because this is the standard format on GPUs and it makes calculations run faster.
If you work with Python and NumPy, it is likely that the default format is, instead, 64-bit floating point numbers. Make sure that you convert your values to 32-bit floats when you save your arrays to npy files:
import numpy as np A = np.array([1.0, 1.1, 1.2, 1.3, 1.4, 1.5]) np.save('/my/folder/data.npy', A.astype('float32'))
The NumPy arrays must also be saved using little-endian byte order and C-ordering. These are likely correct by default, but if you get errors about byte order of Fortran order during upload, check byte swapping and the ascontiguousarray() function from NumPy.
Visualisation of npy files depends on the number of dimensions. Find out more in known issues.
Zip file specifications
The zip file is used to bundle together a combination of csv, image, and npy files.
If a csv file exists inside the zip, it is imported as a regular csv file. If some (or all) features in this file are paths to images (jpg or png) or npy files that exist inside the zip, these files will be imported as features.
If only some of the image files specified exist inside the zip, they will be imported and the missing images will create empty features. You will not be able to run an experiment that requires this feature, since the input should have the same format for all examples.
If several csv files exist inside the zip, you must name one of them
index.csvand this file will be used. Otherwise import will fail.
If the zip file does not contain any csv file, the platform will try to import all the image files that exists inside the zip as long as they are organized in a folder and subfolder structure. In that case, one or more categorical features will be created based on the folder and subfolder path.
If the zip file contains different folder structures, e.g. some images are located in sub-folders and other images are located in sub-sub-folders, then the platform will pick one of the folder structures and ignore files located in the other one.
The first image files found are located in the second level of subfolders. Thus, the categorical features category_0 and category_1 are automatically created based on the folder names.
However, the car images are located in only one level of subfolder. As a result, they are not imported in the dataset.