The abstract MultipartUploader class is the original class to upload a file using multiple parts to Hadoop-supported filesystems. The benefits of a multipart upload is that the file can be uploaded from multiple clients or processes in parallel and the results will not be visible to other clients until the complete function is called.
When implemented by an object store, uploaded data may incur storage charges, even before it is visible in the filesystems. Users of this API must be diligent and always perform best-effort attempts to complete or abort the upload.
All the requirements of a valid MultipartUploader are considered implicit econditions and postconditions: all operations on a valid MultipartUploader MUST result in a new MultipartUploader that is also valid.
The operations of a single multipart upload may take place across different instance of a multipart uploader, across different processes and hosts. It is therefore a requirement that:
All state needed to upload a part, complete an upload or abort an upload must be contained within or retrievable from an upload handle.
If an upload handle is marshalled to another process, then, if the receiving process has the correct permissions, it may participate in the upload, by uploading one or more parts, by completing an upload, and/or by aborting the upload.
Multiple processes may upload parts of a multipart upload simultaneously.
If a call is made to initialize(path) to a destination where an active upload is in progress, implementations MUST perform one of the two operations.
Which upload succeeds is undefined. Users must not expect consistent behavior across filesystems, across filesystem instances *or even across different requests.
If a multipart upload is completed or aborted while a part upload is in progress, the in-progress upload, if it has not completed, must not be included in the final file, in whole or in part. Implementations SHOULD raise an error in the putPart() operation.
A File System which supports Multipart Uploads extends the existing model (Directories, Files, Symlinks) to one of (Directories, Files, Symlinks, Uploads) Uploads of type Map[UploadHandle -> Map[PartHandle -> UploadPart].
The Uploads element of the state tuple is a map of all active uploads.
Uploads: Map[UploadHandle -> Map[PartHandle -> UploadPart]`
An UploadHandle is a non-empty list of bytes.
UploadHandle: List[byte] len(UploadHandle) > 0
Clients MUST treat this as opaque. What is core to this features design is that the handle is valid from across clients: the handle may be serialized on host hostA, deserialized on hostB and still used to extend or complete the upload.
UploadPart = (Path: path, parts: Map[PartHandle -> byte[]])
Similarly, the PartHandle type is also a non-empty list of opaque bytes, again, marshallable between hosts.
PartHandle: List[byte]
It is implicit that each UploadHandle in FS.Uploads is unique. Similarly, each PartHandle in the map of [PartHandle -> UploadPart] must also be unique.
Initialized a Multipart Upload, returning an upload handle for use in subsequent operations.
if path == "/" : raise IOException if exists(FS, path) and not isFile(FS, path) raise PathIsDirectoryException, IOException
If a filesystem does not support concurrent uploads to a destination, then the following precondition is added
if path in values(FS.Uploads) raise PathExistsException, IOException
Upload a part for the multipart upload.
Complete the multipart upload.
A Filesystem may enforce a minimum size of each part, excluding the last part uploaded.
If a part is out of this range, an IOException MUST be raised.
uploadHandle in keys(FS.Uploads) else raise FileNotFoundException FS.Uploads(uploadHandle).path == path if exists(FS, path) and not isFile(FS, path) raise PathIsDirectoryException, IOException parts.size() > 0
If there are handles in the MPU which aren’t included in the map, then the omitted parts will not be a part of the resulting file. It is up to the implementation of the MultipartUploader to make sure the leftover parts are cleaned up.
In the case of backing stores that support directories (local filesystem, HDFS, etc), if, at the point of completion, there is now a directory at the destination then a PathIsDirectoryException or other IOException must be thrown.
UploadData' == ordered concatention of all data in the map of parts, ordered by key exists(FS', path') and result = PathHandle(path') FS' = FS where FS.Files(path) == UploadData' and not uploadHandle in keys(FS'.uploads)
The PathHandle is returned by the complete operation so subsequent operations will be able to identify that the data has not changed in the meantime.
The order of parts in the uploaded by file is that of the natural order of parts: part 1 is ahead of part 2, etc.