Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000438FlexRAID Transparent RAID[All Projects] Generalpublic2017-01-06 20:342017-02-09 11:50
Reporteradridolf 
Assigned ToBrahim 
PrioritynormalSeverityminorReproducibilityalways
StatusassignedResolutionunable to reproduce 
PlatformWindowsOSWindows Server 2012 R2OS Version
Summary0000438: Verify Sync fails at the end of the array
DescriptionI use a daily Verify Sync job for 150 GB each day. This has always been successful so far, except when the end of the array (meaning the last bytes of the largest disks and the parity) is reached. This "final" job fails stating that 2 stripe blocks had been updated.

The sectors(?) given for first and last byte (see additional information) are always the same.
Additional InformationI have one 4 TB disk as PPU and two 4 TB disks, one 1 TB disk and one 600 GB disk as DRUs. The three 4 TB drives have the same product numbers (HGST HDN724040ALE640 (P/N 0F22408)).
All disks are initialized with a GPT. Formatting is always ReFS.

FULL JOB OUTPUT:

Verify Sync RAID [ServerStorage] started at Wed Dec 14 15:32:59 GMT+100 2016

Name: Verify Sync RAID [ServerStorage]
Start Date: Wed Dec 14 15:32:59 GMT+100 2016
End Date: Wed Dec 14 15:38:35 GMT+100 2016
Duration: 00:05:36
Throughput: 317.224MB/s
Total Size: 104.089GB

2 stripe blocks successfully updated

First byte updated at 4000785960960

Last byte updated at 4000785981440

Verify Sync RAID [ServerStorage] ended at Wed Dec 14 15:38:35 GMT+100 2016

TagsNo tags attached.
tRAID Version 1.0
Attached Filespng file icon alltogether.flat.png [^] (2,758,477 bytes) 2017-01-07 20:16
png file icon Verification.PNG [^] (20,376 bytes) 2017-01-08 15:09


log file icon shutdownLog.log [^] (530,443 bytes) 2017-01-08 15:10
png file icon CaptureEndofDisc1500GB.PNG [^] (27,406 bytes) 2017-01-13 10:00


png file icon drivedetailswebui2.png [^] (112,843 bytes) 2017-01-16 11:35

- Relationships

-  Notes
(0001599)
Brahim (manager)
2017-01-07 02:58

Hi,

When you formatted the disks, how much space did you leave at the end of each DRU disk?

It is important to leave a few MBs at the end of each disk to account for dying sector reallocation and whatnot.

Not sure if this was documented on Widows, but it is documented in the "Formatting or verifying the formatting of an existing disk" section of the Linux guide: http://wiki.flexraid.com/2014/06/22/ultimate-linux-guide-to-transparent-raid/ [^]
(0001621)
adridolf (reporter)
2017-01-07 15:23

I used the Windows Disk Management GUI, which seems to leave somewhat less than 1 MB at the end of each disk (if I calculated correctly).

So am I supposed to correct that? (This would mean stopping the array, reboot, change the partition sizes for the individual disks with a partition manager, start the array again, then verify sync everything?)

If I just stayed with the current setup, would the drawback be just a minor data loss in parity at the end or will this corrupt the whole parity and jeopardize a successful recovery in case of a disk failure?

Thanks a lot for your help!


Alignment in detail:
I did not find a program which actually showed me the free space at the end, so I collected data from different programs (wmic partition get ... and AIDA64).
The partition sizes are 4 000 650 887 168 bytes for all formatted 4 TB disks, starting at 135 266 304 bytes (as the GPT is 128 MiB). Disk sizes are 7 814 037 168 sectors times 512 bytes, which equals 4 000 787 030 016 bytes. So the offset at the end should be 876 544 bytes (856 KiB).
Did not calculate for the smaller disks ...
Sectors are 4 KiB physical, 512 bytes logical.
(0001627)
Brahim (manager)
2017-01-07 19:56

Your disks offsets are fine given the values provided.

The reported values truly? Day after day?
First byte updated at 4000785960960
Last byte updated at 4000785981440

I understand that it happens toward the end of the disk. The exact values for each run does matter however.
So, look in your task history and post the values for each run. If you cleared your history, capture the values for the next few days and report back.
(0001629)
adridolf (reporter)
2017-01-07 20:27
edited on: 2017-01-07 20:29

As this might not have been made clear, I use the scheduled range operation with 150 GB per day. With 4 TB of parity, this means 25 days to go through the whole array (which is intended, as I do not want to kill my consumer discs by just checking for parity corruption. As I'm relatively new to this product, I have only reached the end of the array twice (in addition to the initial Verify/Sync to build the array, obviously; at the moment, I'm between 3.222 and 3.369 TB, so the end is close again).

However, both times the end was reached the values were as stated earlier (you can also have a look at the image I assembled - provided as an uploaded file).

From the web UI's log of the scan at the end of the array:
INFO executeAction(363) - Processing range from 3.613TB to 3.638TB...

(0001632)
Brahim (manager)
2017-01-07 21:07

Thanks for the detailed info.

What you could do is run a Range Verify & Sync (Advanced Operations -> Specific Range Operations) at different intervals. Set the range to be the last 1GB.

Let's collect about 5 runs over the next few days and see what's reported.
(0001646)
adridolf (reporter)
2017-01-08 15:09

Thanks for the hint on the specific range operations, I did not know this before.

After testing, I'm convinced the issue is caused by a shutdown of the server
(I added a log of my tests in the attachments):
Except the very first test, every run was successful, until I restarted the machine this morning. After that, I was able to reintroduce the issue by doing a shutdown each time. Note that a simple array restart (whole array, not just the pool) is not sufficient to cause the effect.

In any case, the "repaired" range was the same (like before).

Notes on my system:
Both service and web server run on the same physical machine (Hyper-V host), so are not virtualized, although there exist virtual machines on it which should not bother ...
(0001647)
adridolf (reporter)
2017-01-08 15:12

Oh, and I added the log (service only) during the shutdown (the service start at 15:52). I changed logging to trace BEFORE shutting down.
(0001651)
Brahim (manager)
2017-01-08 16:46

Starting to wonder if you might have a disk with a bad sector of some sort.

Things validating fine till a shutdown could be because the correct data is in the OS's cache. With a bad sector, the sector might not be storing the data.

So far, this is the only report and virtually all users do schedule range verification of their data.
Do you have another array or set of disks you can configure into another array and see if the issue persists there?
(0001652)
adridolf (reporter)
2017-01-08 16:58
edited on: 2017-01-08 16:59

Unfortunately, I do not have an easy way to set up another array. I could try to do this with two external drives via eSata, but those do not have the same sizes (one 1TB and one 1.5TB). If the mercurial issue was resolved, I could try to strip together an internal 1.5 TB and the other one using eSATA.

I could also do tests on my desktop machine, but there I only have one 2 TB disk and a dynamic spanned volume built out of two 1 TB drives. Technically, I could erase the 1 TB drives and test with them, but this would be a different machine (and I do not have a second license for that).

Note that I do NOT use the OS caching feature (don't know how much the OS's cache is involved anyways).

I could however scan for bad sectors using chkdsk (in case this is possible for ReFS) or HDTunePro.

(0001655)
adridolf (reporter)
2017-01-08 17:42

Did an "error scan" of the last GB of all 4 TB disks with HD Tune Pro. Did find nothing. Also note that my 4 TB drives (data and parity)are all relatively new (manufactory dates 05/16, 10/15 and 06/16). In contrast, the smaller drives 1 TB and 600 GB (data) are very old (> 5 years), but should not matter in this context?
(0001656)
Brahim (manager)
2017-01-08 17:47

1. You can use a trial license for testing on other systems. You can even self-extend your trials through the consumer portal.

2. The configurable pool OS Caching is a file level cache and has no bearing on the disk/driver level cache. The disk/driver level cache is what I was referring to.

3. We have to establish that the issue is a software feature and not system specific issue. So, having another setup that exhibit the issue is key in the resolution.
I tested on several systems without being able to replicate the reported issue.

So, update whenever you are able to reproduce the behavior on another setup.
(0001662)
adridolf (reporter)
2017-01-09 14:08

Just a question about testing:
For the sake of speed and mercy with my disks, is it possible to skip the full initial Verify Sync and proceed as follows:
1. Use a formatted DRU (with data) and a PPU without partition (but not technically erased) to set up an array by using the "Do Nothing" approach
2. Instead of a full Verify Sync for initialization, just do a specific range Verify Sync of the last, say, 5 GB
3. Do repeated Verify Sync operations for the last e.g. 1 GB, so in the "tidy" area, to test whether the issue occurs

Would this partial synchronization be sufficient for testing (I understand that this will give me no backup opportunity in case of a drive failure, but a testing environment could be set up in a few minutes)?
(0001664)
Brahim (manager)
2017-01-09 14:34

Yes, that would work.
Parity does not need to be computed or correct for the entire array... only for the range that will be test verified.
(0001690)
adridolf (reporter)
2017-01-13 10:06

So far, I have been able to reproduce the issue with another newly setup storage pool ON THE SAME MACHINE:

DRUs:
1x 1.5 TB disk (GPT, NTFS, with data)
1x 20 GB vhdx virtual disk (GPT, NTFS, since I cannot create a tRaid with one DRU)

PPU:
1x 1.5 TB disk (GPT)

Error is like before (with matching bytes, obviously, see Screenshot)

If I find some time again, I will try to reproduce at a different system (where I should also be able to switch between MBR and GPT)
(0001692)
adridolf (reporter)
2017-01-14 14:37
edited on: 2017-01-14 14:39

During further tests, still on the same machine, I gained additional insights (using the 1.5 TB setup from the note above):

1. I shrunk the data volume on the 1.5 TB DRU by 2 GB, yielding a little more than the same amount of free space at the end. -> No change in behavior

2. Inspired by your suspicion about caching, I tried to stop the array prior to a shutdown manually through the Web UI (note that just stopping and starting the array does not cause the issue), wait for half a minute, and then do the shutdown. -> Still no change in behavior

3. I reinitialized the PPU disk to MBR (giving it about 128 MB more free space previously owned by the GPT partition as a side effect). THIS DID SOLVE THE ISSUE for the 1.5 TB array. I can reproducibly switch the issue on and off by just changing the MBR/GPT initialization of the PPU.
Unfortunately, this is no remedy for my main storage pool, since that one contains disk larger than 2 TB. But maybe this provides a hint for you to narrow down the problem.

In case you haven't realized already, note that the updated area is exactly 20480 bytes for both pools, including the very same last four digits for first byte and last byte. Also note that while this results in TWO updated stripe blocks for the 4 TB array, it results in FIVE updated stripe blocks for the 1.5 TB array, although it is the same number of bytes (both visible in the attached screenshots).

(0001693)
Brahim (manager)
2017-01-14 21:06

If your PPU initialized to GPT is giving you issues. Then, change it to MBR.
MBR or GPT has not impact on the PPU regardless of the disk's size. MBR vs GPT only matters for DRUs.

I would have normally speculated that your disk driver is updating the GPT record on the PPU, but there are two records: one at the start and one at the end. It would make sense that both be updated and not just the one at the end. So really, I have no idea why this is happening to your system. I am not able to replicate on my systems and other users don't seem to have the issue.

So, stop the array, bring the PPU online, re-initialize it to MBR, re-create the parity on the array.
(0001694)
Brahim (manager)
2017-01-14 21:13

FYI, this issue can be re-opened if it can be replicated on a different system.
(0001695)
Brahim (manager)
2017-01-15 18:49

Re-opening this bug, since I have been able to replicate it myself finally.
(0001696)
adridolf (reporter)
2017-01-15 21:39

Today, I have finally changed my PPU to MBR. ;-)

It was quite interesting that the Verify Sync afterwards only changed a small number of stripe blocks, namely the usual ones at the end and the following ones at the beginning:

4 stripe blocks successfully updated

First byte updated at 136134656

Last byte updated at 4203319296

Is it possible that FlexRaid is just not aware of the GPT initialization and just overwrites parts of the GPT partition...?
(0001697)
Brahim (manager)
2017-01-16 03:19

So, in the system where I was able to replicate the issue, it ended being that the range being updated was beyond the range of the disk.

That is, the disks were 3000590369280 bytes in size whereas the first block updated is at 3000591912960 (or about 1543680 bytes pass the end of the disks).

In the Web UI, could please pull up the disk details for the registered disk in that array and share the reported disk sizes?
(0001700)
adridolf (reporter)
2017-01-16 11:42
edited on: 2017-01-16 11:42

I added a screenshot of the size as requested (drivedetailswebui2.PNG). Please remove the earlier version (drivedetailswebui.PNG), since I forget to remove my drive's serial number.

The drive data is the same for all 4 TB drives (including firmware version), both for the DRUs (GPT) and the PPU (MBR). Note that the number of sectors deviates from the data I collected with AIDA64 earlier (altogether.flat.png, lower right).

Note that the array also contains a 1 TB and a 600 GB disk, as discussed earlier. If data for those is required, please say so.

Also note that for a 2-way mirror I set up yesterday with the 1.5 TB disks used during the earlier tests, I did not observe similar issues although the second DRU was initialized GPT before the first verify/sync (However, in this case I would not have expected the same problem to occur).

(0001702)
Brahim (manager)
2017-01-16 17:26

The behavior is the same. The first failure in your case is also 1543680 bytes past the end of the disk just like in my test system.
The Verify & Sync task is processing data past the end of the disk.

It is strange that MBR vs GPT would make a difference here.

Will look deeper into it.
(0001727)
adridolf (reporter)
2017-02-09 11:50

Just for your information: Since I changed to MBR for the PPU, syncing is completely fine. Maybe you can make this a recommended approach somewhere in the wiki (e.g. in a guide for setting up the array).

Note that when reinitializing the PPU from GPT to MBR, it goes out of sync only at the beginning and the end (as one would expect), so no complete Verify/Sync should be necessary in a switching scenario. However, the affected area at the beginning was not the GPT partition itself, but the first 4 GB after it.

Affected bytes in my case:

Beginning:
4 stripe blocks successfully updated

First byte updated at 136134656

Last byte updated at 4203319296

End:
2 stripe blocks successfully updated

First byte updated at 4000785960960

Last byte updated at 4000785981440

- Issue History
Date Modified Username Field Change
2017-01-06 20:34 adridolf New Issue
2017-01-07 02:58 Brahim Note Added: 0001599
2017-01-07 02:58 Brahim Assigned To => Brahim
2017-01-07 02:58 Brahim Status new => feedback
2017-01-07 15:23 adridolf Note Added: 0001621
2017-01-07 15:23 adridolf Status feedback => assigned
2017-01-07 19:56 Brahim Note Added: 0001627
2017-01-07 19:56 Brahim Status assigned => feedback
2017-01-07 20:16 adridolf File Added: alltogether.flat.png
2017-01-07 20:27 adridolf Note Added: 0001629
2017-01-07 20:27 adridolf Status feedback => assigned
2017-01-07 20:29 adridolf Note Edited: 0001629 View Revisions
2017-01-07 21:07 Brahim Note Added: 0001632
2017-01-07 21:07 Brahim Status assigned => feedback
2017-01-08 15:09 adridolf Note Added: 0001646
2017-01-08 15:09 adridolf Status feedback => assigned
2017-01-08 15:09 adridolf File Added: Verification.PNG
2017-01-08 15:10 adridolf File Added: shutdownLog.log
2017-01-08 15:12 adridolf Note Added: 0001647
2017-01-08 16:46 Brahim Note Added: 0001651
2017-01-08 16:46 Brahim Status assigned => feedback
2017-01-08 16:58 adridolf Note Added: 0001652
2017-01-08 16:58 adridolf Status feedback => assigned
2017-01-08 16:59 adridolf Note Edited: 0001652 View Revisions
2017-01-08 16:59 adridolf Note Edited: 0001652 View Revisions
2017-01-08 17:42 adridolf Note Added: 0001655
2017-01-08 17:47 Brahim Note Added: 0001656
2017-01-08 17:47 Brahim Status assigned => feedback
2017-01-09 14:08 adridolf Note Added: 0001662
2017-01-09 14:08 adridolf Status feedback => assigned
2017-01-09 14:34 Brahim Note Added: 0001664
2017-01-09 14:34 Brahim Status assigned => feedback
2017-01-13 10:00 adridolf File Added: CaptureEndofDisc1500GB.PNG
2017-01-13 10:06 adridolf Note Added: 0001690
2017-01-13 10:06 adridolf Status feedback => assigned
2017-01-14 14:37 adridolf Note Added: 0001692
2017-01-14 14:39 adridolf Note Edited: 0001692 View Revisions
2017-01-14 21:06 Brahim Note Added: 0001693
2017-01-14 21:06 Brahim Status assigned => closed
2017-01-14 21:06 Brahim Resolution open => unable to reproduce
2017-01-14 21:13 Brahim Note Added: 0001694
2017-01-15 18:49 Brahim Note Added: 0001695
2017-01-15 18:49 Brahim Status closed => confirmed
2017-01-15 21:39 adridolf Note Added: 0001696
2017-01-16 03:19 Brahim Note Added: 0001697
2017-01-16 03:19 Brahim Status confirmed => feedback
2017-01-16 11:33 adridolf File Added: drivedetailswebui.PNG
2017-01-16 11:35 adridolf File Added: drivedetailswebui2.png
2017-01-16 11:42 adridolf Note Added: 0001700
2017-01-16 11:42 adridolf Status feedback => assigned
2017-01-16 11:42 adridolf Note Edited: 0001700 View Revisions
2017-01-16 16:11 Brahim File Deleted: drivedetailswebui.PNG
2017-01-16 17:26 Brahim Note Added: 0001702
2017-02-09 11:50 adridolf Note Added: 0001727


Copyright © 2000 - 2010 MantisBT Group
Powered by Mantis Bugtracker