Requirements of imported files
You can upload different types of files into the Platform:
-
zip: A compressed zip file, containing one or more csv, image, or npy files.
-
csv: A comma-separated valued text file, the easiest way to upload data with different kind of encoding, e.g., text or integer. Use for tabular data.
-
images: The Platform supports images in the jpg/jpeg, png, and gif file formats.
-
npy: A saved NumPy array file, where different examples are listed along the first array dimension.
Preprocessing help
Our repository Sidekick helps you fix your data before you upload it. Sidekick’s aim is to make it easier to get data into the Platform.
Zip file specifications
The zip file is used to bundle together files to import the files to the platform.
Zip with index.csv including file paths
The zip file must include a file named index.csv
.
The index.csv
must include paths to the files in the zip.
Example: MNIST.zip including index.csv
used in the MNIST tutorial. Download and inspect the structure if you want to.
If only some of the image files specified exist inside the zip, they will be imported and the missing images will create empty features. You will not be able to run an experiment that requires this feature, since the input should have the same format for all examples.
Example:
Zip with no csv
If the zip file does not contain any csv file you must organize the included files in a folder- and subfolder-structure.
During the import, one or more categorical features will be created based on the folder and subfolder path; see example below.
Same structure in zip
Include all files in one category on one folder and name the folder. After you’ve imported the zip all files in one folder will get the same label as the folder name.

Example:
Simple zip with MNIST images in a folder structure. Download and inspect the structure if you want to.
Different folder structures
Suppose the zip file contains different folder structures. In that case, e.g., some images are located in sub-folders and other images are located in sub-sub-folders, the platform will pick one of the folder structures and ignore files located in the other one.
Example:
The first image files found are located in the second level of subfolders.
Thus, the categorical features category_0 and category_1 are automatically created based on the folder names.
However, the car images are located in only one level of subfolder.
As a result, they are not imported into the dataset.
Zip with only one csv
If a csv file exists inside the zip, it is imported as a regular csv file.
If some (or all) features in the csv are paths to images (jpg or png) or npy files that exist inside the zip, these files will be imported as features.
Csv file specifications
The csv (comma-separated values) file format is a simple text file format that can be viewed and edited easily by many programs and libraries.
Csv structure
The first line of the file gives the name of features, separated by commas. Each following line gives the values of each feature for a single example, separated by commas.
Csv usage
A csv file can be used to upload every type of feature that the platform supports:
-
Single numeric values.
-
Categorical values: a text string or number that is identical for all the examples of the dataset belonging to the same category.
-
Text values: a string of text in natural language.
If the text feature contains a comma, enclose the whole text between double quotes to avoid splitting the text across several features. If the text feature contains double quotes, replace them with two double-quotes.
For example:"He said ""This is a text feature"", and pressed upload."
-
Images: specify the name (and path) of images and zip them together with the csv file.
-
Multi-dimensional numeric values (i.e. tensors): specify the name (and path) of npy files and zip them together with the csv file.
Csv restrictions
-
Save your csv files using the UTF-8 encoding.
Files saved with different encoding might work on the platform, but special characters may cause incompatibilities. -
Use less than 50000 characters in a single feature.
This is particularly true if you upload a text feature, where the text could be arbitrarily long. Consider that no model could handle such long features anyway. -
Be aware that the platform supports up to 2000 unique categories per feature, but datasets with more categories will not run properly.
-
Try to limit your datasets to less than 10 millions rows. You will most likely hit very long training times, and it is rare that a model needs so many examples to train itself.
-
Use English style decimal point (
.
), for example, Pi=3.14
.
Some languages, for example, Spanish and Swedish, use a decimal comma (,
) instead of a decimal point, for example, Pi=3,14
. This might cause problems when you use a csv exported from a Spanish or Swedish version of Excel (or similar).
Images specifications
The platform supports images in the jpg/jpeg, png, and gif file formats.
Image size and color
There is no requirement on the resolution or color of the images that you upload, so that you can mix any combination of images in the same dataset.
Every example will be converted using the Transformation and Color mode options to make sure that your models are provided with the fixed size of data that they need.
-
The resolution of the images, i.e., the width and height in pixels, is controlled on platform by the Transformation option.
-
The amount of channels, i.e., grayscale or color with transparency or not, is controlled by the Color mode option.
Local image files
Images that you have on your local computer need to be archived inside a zip file, either with an index file or without an index file.
Image files stored online
Images stored on Google Cloud Storage or on Amazon S3 can be imported by the Platform if you specify each image URL as a feature, as long as they are public.
See making data public for Google Cloud and Amazon S3.
-
Google Cloud Storage URLs start with
gs://
-
Amazon S3 URLs start with
s3://
This can be done with either an index csv file, or inside a data warehouse table.
Example:
Npy file specifications
The npy file is a binary file format that can save a NumPy array, where NumPy is the Python library that is commonly used in machine learning.
You can use npy files to upload a single feature, which can be a single numeric value or a multi-dimensional array of numeric values (i.e., a tensor).
Single npy file for all examples
If you upload a single npy file, the dataset’s examples are assumed to be arranged along the first dimension.
Example:
If the npy array has the shape (1000, 20, 10, 3), then the platform will treat it as 1000 examples of a tensor feature with shape (20, 10, 3).
If your examples have several features, you can upload several npy files. They will be added following the multiple file upload rules.
One npy file per example in a zip file
Npy restrictions
-
The platform only supports 32-bit floating point numbers, because this is the standard format on GPUs and it makes calculations run faster.
If you work with Python and NumPy, it is likely that the default format is, instead, 64-bit floating point numbers. Make sure that you convert your values to 32-bit floats when you save your arrays to npy files:import numpy as np A = np.array([1.0, 1.1, 1.2, 1.3, 1.4, 1.5]) np.save('/my/folder/data.npy', A.astype('float32'))
-
The NumPy arrays must also be saved using little-endian byte order and C-ordering. These are likely correct by default, but if you get errors about byte order of Fortran order during upload, check byte swapping and the ascontiguousarray() function from NumPy.
-
Visualisation of npy files depends on the number of dimensions. Find out more in known issues.