Is Amazon AWS/EBS snapshotting just LVM, or what?

Thu Sep 28 14:10:07 EDT 2017

On 09/28/2017 01:32 PM, Ken D'Ambrosio wrote:
> I would say it's unlikely to be LVM, because LVM is content-ignorant; it snapshots the entire volume, which is
> inefficient, and when you're Amazon, you care a LOT about being efficient.  Instead, I imagine they're using some
> content-aware CoW solution such as ZFS.  But, whatever mechanism, I agree with your opinion: I doubt that their solution
> -- almost certainly CoW of some sort -- stands a chance of being more than even slightly impactful.

Oh--yeah, ZFS is another good candidate. Actually, there are a few others that I can think of as well....

But it is basically `screaming "COW"' at me, and my gut is telling me that this `fear of over-snapshotting'
is basically (generally) the same as when people talk about how they `need to do multithreading [for EVERYTHING]
because it's so expensive to fork a new process' (there are some corner cases where fork() is actually
`too expensive' in at least some sense [and I've actually run into some of those cases],
 but *most* of those claims always seemed to be from people who didn't even know that COW was a thing...).

> $.02, YMMV and other assorted disclaimers,
> 
> -Ken
> 
> 
> On 2017-09-28 13:16, Joshua Judson Rosen wrote:
>> I'm working on a project that uses Amazon AWS-provided VPS instances,
>> and the other guy on the project is telling me that "snapshotting
>> hourly may degrade performance",
>> and I'm trying to determine where that's actually true. My gut feeling
>> is that it sounds kind of bogus.
>>
>>> From the information I've been able to find about how Amazon's stuff works (either in terms
>> of how it's _implemented_ [for which I'm finding basically no insight]
>> or how it's _characterized_
>> [in the engineering sense, not the literary sense]...), it really
>> sounds a _lot_ like Amazon
>> is just using LVM snapshots, e.g. from <https://aws.amazon.com/ebs/faqs/>:
>>
>>     "snapshots can be done in real time while the volume is attached and in use.
>>      However, snapshots only capture data that has been written to your
>> Amazon EBS volume,
>>      which might exclude any data that has been locally cached by your
>> application or OS."
>>
>>     "By design, an EBS Snapshot of an entire 16 TB volume should take no
>> longer than the time
>>      it takes to snapshot an entire 1 TB volume. However, the actual time
>> taken to create
>>      a snapshot depends on several factors including the amount of data
>> that has changed
>>      since the last snapshot of the EBS volume."
>>
>> ... though I'm not entirely sure how to interpret that last bit about
>> "time taken to create a snapshot
>> depends on... the amount of data that has changed since the last snapshot";
>> the _first half of that statement_ reads as "creating a snapshot is
>> constant time",
>> which basically screams to me "copy-on-write just like LVM, and
>> they're probably implemented
>> in terms of LVM".
>>
>> Any insight here as to whether my gut is correct on this, or whether
>> I'm actually likely
>> to notice an impact from hourly snapshots of, say, a 200-GB volume?
>> How about a 1-TB volume?
>>
>> The only thing I'm seeing from Amazon that seems to _vaguely_ support
>> (maybe) the notion
>> that `snapshotting too often' would be something to worry about is
>> this bit from elsewhere
>> in that same FAQ page (under the heading of "performance", whereas the
>> others were
>> under the heading of "snapshots" and a subheading of "performance
>> consistency of my HDD-backed volumes":
>>
>>     Another factor is taking a snapshot which will decrease expected
>> write performance
>>     down to the baseline rate, until the snapshot completes.
>>
>> ... and, taken in the context of the previously-cited notes about
>> snapshots being
>> `not base on volume-size but maybe influenced by
>> changed-since-last-snapshot set size'
>> (and in the context of the explanations they give for HDD-backed vs.
>> SSD-backed storage),
>> I'm basically reading that as:
>>
>>     `if you're using HDD-backed storage then it's because you care about
>> *throughput*
>>      more than *response time* and are likely to be monitoring throughput,
>>      and if you're monitoring throughput you may notice a *momentary dip
>> in throughput*
>>      as the *HDDs* need to seek around to find the volume boundaries and
>> set up the COW records.'
>>
>> Even if you don't have any insight into what's actually happening
>> under the covers at Amazon,
>> does my reading of all of this sound right to you?
>>
>> And, perhaps more interestingly, are these same caveats from Amazon
>> generally applicable to LVM?
> 

-- 
"Don't be afraid to ask (λf.((λx.xx) (λr.f(rr))))."