How to combine multiple S3 objects in the target S3 object w/o leaving S3

+2 votes

I understand that the minimum part size for uploading to an S3 bucket is 5MB Is there any way to have this changed on a per-bucket basis?

The reason I'm asking is there is a list of raw objects in S3 which we want to combine in the single object in S3.

Using PUT part/copy we are able to "glue" objects in the single one providing that all objects except last one are >= 5MB. However sometimes our raw objects are not big enough and in this case when we try to complete multipart uploading we're getting famous error "Your proposed upload is smaller than the minimum allowed size" from AWS S3.

Any other idea how we could combine S3 objects without downloading them first?

Aug 28, 2018 in AWS by bug_seeker
• 15,350 points
1,751 views

1 answer to this question.

+1 vote

"However sometimes our raw objects are not big enough... "

You can have a 5MB garbage object sitting on S3 and do concatenation with it where part 1 = 5MB garbage object, part 2 = your file that you want to concatenate. Keep repeating this for each fragment and finally use the range copy to strip out the 5MB garbage

answered Aug 28, 2018 by Priyaj
• 56,520 points
@Priyaj , The idea seems to be great , but can you please help to clarify the following:

By Concatenation did u mean reading the smaller chunk file stream and the garbage object and then concat and write back. Do you think there is some better way of doing that instead of reading the whole stream.

 Also if keep repeating the same , there were multiple chunks of garbage collection in between the concatenated file, how do I remove it from the file as the file is already written to S3 with those Garbage object.Did u mean ready only the desired part , in that case I have to maintain those ranges (or a delimiter ) to separate that 5MB , is this what you are talking about. Please suggest.

I think using garbage collection seems like a lot of hard work and also, there is no way of deleting with this garbage collection. 

I think it's better to use the multipart upload. Here what happens is that you first initiate the multipart upload and then upload part by part. Assign a unique number to each part. Once you've uploaded everything, s3 concatenates the parts in ascending order of the part number. 

But again the minimum size of each part is 5MB. So make sure you divide your object into parts accordingly. 

Have a look at this: https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

So finally I implemented it, I stream all the files one by one and created temporary files of 5MB each, then using copypartObject and Multipart upload I created the consolidated files using the 5MB files in order.

So if I have originally 3 files of 6 MB , 7 MB and 4 MB- I created 3 files of 5 + 5 +5+2 and then used Multipart upload on these files.

Hopefully, it will help someone.
Hey @Ankur, I am glad you figured the workaround. It seems like a very smart approach. Can you post this as an answer as well so that its easier for other readers to understand? Thanks a lot!

Related Questions In AWS

0 votes
1 answer

How to list all the objects in Amazon S3?

As stated already, Amazon S3 indeed requires ...READ MORE

answered Oct 5, 2018 in AWS by Archana
• 4,090 points
333 views
0 votes
1 answer

How to download the latest file in a S3 bucket using AWS CLI?

You can use the below command $ aws ...READ MORE

answered Sep 6, 2018 in AWS by Archana
• 4,090 points
3,789 views
0 votes
1 answer

How to upload an object into Amazon S3 in Lambda?

I suspect you are calling the context.done() function before s3.upload() has ...READ MORE

answered Sep 27, 2018 in AWS by Archana
• 4,090 points
53 views
0 votes
1 answer

How to make multiple files in Amazon S3 public?

I had to change several hundred thousand ...READ MORE

answered Oct 17, 2018 in AWS by Archana
• 4,090 points
63 views
+13 votes
2 answers

Git management technique when there are multiple customers and need multiple customization?

Consider this - In 'extended' Git-Flow, (Git-Multi-Flow, ...READ MORE

answered Mar 26, 2018 in DevOps & Agile by DragonLord999
• 8,380 points
152 views
+1 vote
2 answers

AWS CloudWatch Logs in Docker

The awslogs works without using ECS. you need to configure ...READ MORE

answered Sep 6, 2018 in AWS by bug_seeker
• 15,350 points
332 views
0 votes
1 answer

how to access AWS S3 from Lambda in VPC

With boto3, the S3 urls are virtual by default, ...READ MORE

answered Sep 28, 2018 in AWS by Priyaj
• 56,520 points
1,674 views