Requirements of imported files
You can upload different types of files into the Platform:
zip: A compressed zip file, containing one or more csv, image, or npy files.
images: The Platform supports images in the jpg/jpeg, png, and gif file formats.
npy: A saved NumPy array file, where different examples are listed along the first array dimension.
Our repository Sidekick helps you fix your data before you upload it. Sidekick’s aim is to make it easier to get data into the Platform.
Zip file specifications
The zip file is used to bundle together files to import the files to the platform.
Zip with index.csv including file paths
The zip file must include a file named
index.csv must include paths to the files in the zip.
Example: MNIST.zip including
index.csv used in the MNIST tutorial. Download and inspect the structure if you want to.
If only some of the image files specified exist inside the zip, they will be imported and the missing images will create empty features. You will not be able to run an experiment that requires this feature, since the input should have the same format for all examples.
Zip with no csv
If the zip file does not contain any csv file you must organize the included files in a folder- and subfolder-structure.
During the import, one or more categorical features will be created based on the folder and subfolder path; see example below.
Same structure in zip
Include all files in one category on one folder and name the folder. After you’ve imported the zip all files in one folder will get the same label as the folder name.
Simple zip with MNIST images in a folder structure. Download and inspect the structure if you want to.
Different folder structures
Suppose the zip file contains different folder structures. In that case, e.g., some images are located in sub-folders and other images are located in sub-sub-folders, the platform will pick one of the folder structures and ignore files located in the other one.
The first image files found are located in the second level of subfolders. Thus, the categorical features category_0 and category_1 are automatically created based on the folder names.
However, the car images are located in only one level of subfolder. As a result, they are not imported into the dataset.
Zip with only one csv
If a csv file exists inside the zip, it is imported as a regular csv file.
If some (or all) features in the csv are paths to images (jpg or png) or npy files that exist inside the zip, these files will be imported as features.
Csv file specifications
The csv (comma-separated values) file format is a simple text file format that can be viewed and edited easily by many programs and libraries.
The first line of the file gives the name of features, separated by commas. Each following line gives the values of each feature for a single example, separated by commas.
A csv file can be used to upload every type of feature that the platform supports:
Single numeric values.
Categorical values: a text string or number that is identical for all the examples of the dataset belonging to the same category.
Text values: a string of text in natural language.
If the text feature contains a comma, enclose the whole text between double quotes to avoid splitting the text across several features. If the text feature contains double quotes, replace them with two double-quotes.
"He said ""This is a text feature"", and pressed upload."
Images: specify the name (and path) of images and zip them together with the csv file.
Multi-dimensional numeric values (i.e. tensors): specify the name (and path) of npy files and zip them together with the csv file.
Save your csv files using the UTF-8 encoding.
Files saved with different encoding might work on the platform, but special characters may cause incompatibilities.
Use less than 50000 characters in a single feature.
This is particularly true if you upload a text feature, where the text could be arbitrarily long. Consider that no model could handle such long features anyway.
Be aware that the platform supports up to 2000 unique categories per feature, but datasets with more categories will not run properly.
Try to limit your datasets to less than 10 millions rows. You will most likely hit very long training times, and it is rare that a model needs so many examples to train itself.
Use English style decimal point (
.), for example, Pi=
Some languages, for example, Spanish and Swedish, use a decimal comma (
,) instead of a decimal point, for example, Pi=
3,14. This might cause problems when you use a csv exported from a Spanish or Swedish version of Excel (or similar).
The platform supports images in the jpg/jpeg, png, and gif file formats.
Image size and color
There is no requirement on the resolution or color of the images that you upload, so that you can mix any combination of images in the same dataset.
Every example will be converted using the Transformation and Color mode options to make sure that your models are provided with the fixed size of data that they need.
Local image files
Image files stored online
Images stored on Google Cloud Storage or on Amazon S3 can be imported by the Platform if you specify each image URL as a feature, as long as they are public.
See making data public for Google Cloud and Amazon S3.
Google Cloud Storage URLs start with
Amazon S3 URLs start with
Npy file specifications
The npy file is a binary file format that can save a NumPy array, where NumPy is the Python library that is commonly used in machine learning.
You can use npy files to upload a single feature, which can be a single numeric value or a multi-dimensional array of numeric values (i.e., a tensor).
Single npy file for all examples
If you upload a single npy file, the dataset’s examples are assumed to be arranged along the first dimension.
If the npy array has the shape (1000, 20, 10, 3), then the platform will treat it as 1000 examples of a tensor feature with shape (20, 10, 3).
If your examples have several features, you can upload several npy files. They will be added following the multiple file upload rules.
One npy file per example in a zip file
The platform only supports 32-bit floating point numbers, because this is the standard format on GPUs and it makes calculations run faster.
If you work with Python and NumPy, it is likely that the default format is, instead, 64-bit floating point numbers. Make sure that you convert your values to 32-bit floats when you save your arrays to npy files:
import numpy as np A = np.array([1.0, 1.1, 1.2, 1.3, 1.4, 1.5]) np.save('/my/folder/data.npy', A.astype('float32'))
The NumPy arrays must also be saved using little-endian byte order and C-ordering. These are likely correct by default, but if you get errors about byte order of Fortran order during upload, check byte swapping and the ascontiguousarray() function from NumPy.
Visualisation of npy files depends on the number of dimensions. Find out more in known issues.