Filesystem overhead

Mon Aug 4 08:28:49 EDT 2003

Ben, et.al.,

  Your explaination has corrected some misconceptions I had regarding 
journaling
filesystems. Thanks. I think I've gleaned that the journal is an "add 
before you subtract"
kind of system, meaning you never put at risk information you don't have 
a copy of
squirrled away somewhere else (just in case). Somehow, this reminds me of my
workshop; only I do more adding than subtracting ;-)

I did read up the ext3 implementation a bit. It's basically ext2 with a 
journal file (/.journal)
That is kind of neat and helps minimize the potential (as you point out 
below) for new
bugs as much of the code is reused. An ext3 fielsystem can even mounted 
as ext2 if it
is unmounted cleanly. Also, there are tools to add a journal to an ext2 
filesystem essential
converting it to ext3. Not to denegrate other journaling filesystems, 
but, it would seem
ext3 is a nice way to go if your're already comfortable with ext2 and 
you want journalling.

-- 
____    __
 | 0|___||.   Andrew Gaunt *nix Sys. Admin,, etc. Lucent Technologies
_| _| : : }   quantum at lucent.com - http://www-cde.mv.lucent.com/~quantum
 -(O)-==-o\   andrew_gaunt at hotmail.com - http://www.gaunt.org

bscott at ntisys.com wrote:

>On Wed, 30 Jul 2003, at 8:35am, quantum at lucent.com wrote:
>  
>
>>Very cool, that was revealing. Perhaps this discussion can evolve into how
>>journalling (e.g. ext3, etc.) works and why it is good/bad. Anybody?
>>    
>>
>
>  If a system crashes (software, hardware, power, whatever) in the middle of
>a write transaction, then it likely that the filesystem will be left in an
>inconsistent state.  For that reason, many OSes will run a consistency check
>on a filesystem that was not unmounted cleanly before mounting it again.  
>Most everyone here has probably seen "fsck" run after a crash for this
>reason.
>
>  That consistency check can take quite a long time, especially on a large
>filesystem.  If the filesystem is sufficiently large, the check time can be
>hours.  Worse still, if the crash happened at just the right (or wrong) time,
>it can cause logical filesystem damage (e.g., a corrupt directory), causing
>additional data loss.
>
>  To solve this problem, one can use a journaling filesystem.  A
>journaling filesystem does not simply write changes to the disk.  First, it
>writes the changes to a journal (sometimes called a "transaction log" or
>just "log").  Then it writes the actual changes to the disk (sometimes
>called "committing").  Finally, it updates the journal to note that the
>changes were successfully written (sometimes called "checkpointing").
>
>  Now, if the system crashes in the middle of a transaction, upon re-mount,
>the system just has to look at the journal.  If a complete transaction is
>present in the journal, but has not been checkpointed, the journal is
>"played back" to ensure the filesystem is made consistent.  If an incomplete
>transaction is present in the journal, it was never committed, and thus can
>be discarded.
>
>  Of course, none of this guarantees you won't lose data.  If a program was
>in the middle of writing data to a file when the system crashed, chances
>are, that file is now scrambled.  Journaling protects the filesystem itself
>from damage, and avoids the need for a consistency after a crash.
>
>  It is also important to understand the difference between journaling
>*all* writes to a filesystem, and journaling just *metadata* writes.  The
>term "metadata" means "data about data".  Things such as a file's name,
>size, time it was last modified, the specific blocks on disk used to store
>it, that sort of thing, is metadata.  The metadata is critical, because
>corruption of a small amount of metadata can lead to the loss of large
>amounts of file data.
>
>  Some journaling filesystems journal just metadata.  This keeps the
>filesystem itself from becoming inconsistent in a crash, but may leave the
>file data itself corrupted.  ReiserFS does this.  Why journal just metadata?  
>Because journaling everything can cause a big performance hit, and, as
>noted above, if the system crashed in the middle of a write, there is a good
>chance you've already lost data anyway.
>
>  Other filesystems journal all writes, or at least give you the option to.  
>EXT3 is one such filesystem.  This can prevent file corruption in the case
>where an "atomic" write of the file data was buffered in memory and being
>written to disk when the crash occurred.
>
>  About the only real drawback to a journaling filesystem is the
>performance hit.  You have to write everything to disk *twice*: Once to the
>journal, and once to the actual filesystem.
>
>  There are other drawbacks:  Journaling filesystems are more complex, so
>are statistically more likely to have bugs in the implementation.  But a
>non-journaling filesystem can have bugs, too, so I think the best answer is
>just more through code review and more testing.  The journal also uses some
>space on the disk.  But as the space used by the journal is typically
>megabytes on a multi-gigabyte filesystem, the overhead is insignificant.
>
>  Finally, a journaling filesystem does not eliminate the need for "fsck"  
>and similar programs.  Inconsistencies can be introduced into a filesystem
>in other ways (such as bugs in the filesystem code or hardware problems).  
>Since, with a journaling filesystem, "fsck" will normally *never* be run
>automatically by the system, it becomes a good idea to run an fsck on a
>periodic basis, "just in case".  EXT2/3 even has a feature that will cause
>the filesystem to be automatically checked every X days or every Y mounts.
>
>  Hope this helps,
>
>  
>