Fix/user.status fattr missing
Description
If there is a problem with the migration it may fail to set the user.status
file attribute. This means that we (1) can't check that the files have been migrated to know if calibration should start and (2) that the webservice wait_on_transfer
loop continues for almost an hour, even though the file attribute will likely not be present until migration is manually triggered again.
This adds in a separate check for the attribute being present, and exits the wait_on_transfer
loop after 5 minutes if it is not.
How Has This Been Tested?
Good question! It hasn't. I checked the logic of the check implemented in _check_fattr_present
manually, but haven't tested this branch directly.
We could deploy it to the test server and request ITDM remove the flag on a run, then try triggering the calibration via MyMDC to see what happens?
Relevant Documents (optional)
Some DOC tickets
Types of changes
- Bug fix (non-breaking change which fixes an issue)
Checklist:
-
My code follows the code style of this project. -
I added tests where appropriate.
Reviewers
Merge request reports
Activity
Most importantly, let's involve @jmalka in the discussion, because he already said he would set a different value of the attribute if migration fails (https://in.xfel.eu/redmine/issues/102689 ). If that's happening in the near future anyway, it gives us a better alternative, and this is unnecessary.
- Resolved by Thomas Kluyver
@jmalka have you picked a value to set if the migration fails? Also, which repository has the code managing the transfers in it, I wanted to look through it and maybe do a PR but couldn't find it
added 1 commit
- d3c30619 - Use `os....xattr` instead of process calls, simplify,
Attribute user.status values:
migration_in_progress - for the time from a run folder creation till the migration is finished:
if successful
offline
else:
notmigrated2d2
migration_in_progress and notmigrated2d2 is still WIP
if all files from a given run are copied to dCache disk pool:
dCache
if all files from a given run are written to tape gets:
tape
How about
online
? The (3 year old) code checked:if retcode == 0 and 'status="online"' not in stdout.decode().lower()
, isonline
no longer used?edit: Spoke to Janusz,
online
is not used and the check has probably not been useful for a long timemigration_in_progress
is set once the folder is creatednotmigrated2d2
- if the sizes of files between online/offline do not match up migration is triggered again, if it still does not match then it is set to not migratedEdited by Robert Roscamean it's still trying to migrate (including automatic retries)? Or does that mean that any retries have failed and it has given up?
From what I understood it does:
-
migration_in_progress
- migration starts - Migration 'finishes
- Check file sizes between online/offline clusters:
- If sizes match it is successful, set to
offline
, ordCache
/tape
- If sizes do not match it is not successful, set to
notmigrated2d2
, migration gets triggered again for some number of retries, once the retry limit gets reached the attribute remains set tonotmigrated2d2
- If sizes match it is successful, set to
So it means that there was an issue and that it may still be retrying or that it has given up, as far as I understood. Is that right @jmalka ?
-
Sweet, thanks for the clarification!
CI fails since it has a test which mocks process calls which no longer work now that it uses
os.listxattr
/os.getxattr
. Conveniently this tests the new behaviour:------------------------------ Captured log call ------------------------------- WARNING root:webservice.py:563 `status` attribute missing, migration may have failed, on attempt 1/5 WARNING root:webservice.py:563 `status` attribute missing, migration may have failed, on attempt 2/5 WARNING root:webservice.py:563 `status` attribute missing, migration may have failed, on attempt 3/5 WARNING root:webservice.py:563 `status` attribute missing, migration may have failed, on attempt 4/5 WARNING root:webservice.py:563 `status` attribute missing, migration may have failed, on attempt 5/5 CRITICAL root:webservice.py:556 `status` attribute missing after max tries for migration reached. Migration may have failed, try triggering migration manually again.
Updated the tests now for the new xattr checks
- Resolved by Thomas Kluyver