Skip to content

Conversation

@bbockelm
Copy link
Collaborator

This PR adds a new OSS implementation to XRootD for gathering statistics about the underlying OSS performance.

It keeps counters about the number of operations performed and the overall time spent performing these operations then sends them out to the g-stream. It keeps a separate set of counters for "slow" operations (configurable), providing insight to administrators about the user experience beyond the averages. The intent is this information can easily be passed to time-series monitoring system such as Prometheus.

This adds two configuration options:

fsstats.trace [all|debug|info|warning|error]

fsstats.trace manages the verbosity of the logging done by the OSS statistics library. When debug
information is enabled, it will print out the performance counters once per second.

If fsstats.trace is not specified, then the default is fsstats.trace warning.

and

fsstats.slowops [duration]

The fsstats.slowops directive determines the threshold where a given storage operation is
considered "slow". Slow operations are kept as a separate set of monitoring counters, allowing
the administrator to see how often a storage system has poor performance.

The duration must be specified as a number with a unit such as 1.5s or 1s50ms; valid units
include h (hours), m (minutes), s (seconds), ms (milliseconds), us (microseconds), or
ns (nanoseconds).

If fsstats.slowops is not specified, then the default is fsstats.slowops 1s.

While the intent is the implementation is lightweight statistics gathering, everything has a cost; hence, this is kept as a loadable plugin instead of in the XRootD core.

The initial use case is to understand the performance of the storage subsystems in XCache. Currently our only performance monitoring is from looking at the throttle plugin. The throttle plugin includes both the storage performance for cache hits and the proxy performance for cache misses; I want to exclude the latter.

@amadio amadio requested a review from abh3 May 31, 2024 14:03
@amadio
Copy link
Member

amadio commented May 31, 2024

I've requested a review from @abh3, but first the CI needs to be fully green. You can look at the jobs matching this pull request number in CDash for a quicker way to dig through the error logs of the builds. Cheers,

@abh3
Copy link
Member

abh3 commented May 31, 2024

Sure, I'll review it but it will take me a bit longer than normal as I will be busy with an ATLAS meeting in Oslo.

@amadio
Copy link
Member

amadio commented Jun 1, 2024

Sure, I'll review it but it will take me a bit longer than normal as I will be busy with an ATLAS meeting in Oslo.

This needs a major release, I believe, so no need to rush. Cheers,

Copy link
Member

@abh3 abh3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am confused as to what this is. A real storage system or a wrapper to a storage system. My inclination is that this is a wrapper and, if so, i9t does not follow the conventions for wrapping an oss plugin. First, it should use the class defined in XrdOssWrapper.hh because doing so alleviates us from having to chase down yet another plugin API when we make future changes in the Oss API. Second, as a wrapper it should not have a GetFileSystem entry point as it can only be added using the "++" convention. Thirdly, a wrapper should not load or otherwise initialize the file system it is wrapping. That is done by the plugin system. So, that code is not appropriate here. So, could this be rewritten to follow the standard way Oss plugins wrappers should be written?

@bbockelm
Copy link
Collaborator Author

First, it should use the class defined in XrdOssWrapper.hh because doing so alleviates us from having to chase down yet another plugin API when we make future changes in the Oss API

Are there any examples of using these classes? I looked at the wrapper and it appears simple enough but it'd be good to know if I'm the only user of the class.

Second, as a wrapper it should not have a GetFileSystem entry point as it can only be added using the "++" convention. Thirdly, a wrapper should not load or otherwise initialize the file system it is wrapping.

Sadly, it's not possible to rely on the ++ syntax as you suggest due to #1730. Unfortunately, that issue means a wrapper writer needs to implement exactly what you say the wrapper should not do. If you are able tackle issue #1730, then I'd happily drop the other entry points (on this and other plugins).

@abh3
Copy link
Member

abh3 commented Jun 17, 2024 via email

@abh3
Copy link
Member

abh3 commented Jun 17, 2024 via email

@amadio amadio force-pushed the devel branch 4 times, most recently from 472b0ee to e358ac0 Compare July 3, 2024 12:40
@bbockelm
Copy link
Collaborator Author

@abh3 - updated to rebase on the XrdOssWrapper and removed the support for running as part of the cmsd.

Is the XrdOssWrapper part of the public API/ABI yet? The design is awkward to utilize because it does not own the wrapDF object. Hence, I have to keep two copies of the pointer, once in StatsFile (to own the lifetime) and once in the XrdOssWrapDF (a reference).

If the API is not frozen, it'd be better to change the wrapper to own the object to manage the lifetime.

@xrootd-dev
Copy link

xrootd-dev commented Jul 24, 2024 via email

@bbockelm
Copy link
Collaborator Author

@abh3 - just a bump on this PR (I see I didn't press the "re-review" button; I'll do that now). If it wasn't obvious from the history, I did the refactoring you requested back in July. My follow-up questions were simply around the API design.

@bbockelm bbockelm requested a review from abh3 August 26, 2024 17:54
@abh3
Copy link
Member

abh3 commented Aug 26, 2024 via email

Copy link
Member

@abh3 abh3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's resolve the comments.

@amadio amadio force-pushed the devel branch 2 times, most recently from 48d417e to 9933e04 Compare September 4, 2024 09:56
@bbockelm
Copy link
Collaborator Author

@abh3 - any news on this front?

I left the one remaining discussion thread open because I wanted to make sure you understood how the visibility changes work.

@amadio
Copy link
Member

amadio commented Jan 27, 2025

This needs to wait for a feature release (i.e. at least 5.8), while the next one will be a patch release (i.e. 5.7.3). Other than that, once @abh3 approves, we can merge to devel.

@bbockelm
Copy link
Collaborator Author

Yes, no problems waiting for a feature release! Just trying to tidy up pending PRs on a Monday morning.

Copy link
Member

@abh3 abh3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbockelm Almost done here. Only one question remains (no code changes needed). We don'y like putting things in the repo that a) are not explainable, b) look like they are not needed. So, could you address the question posed here.

@@ -0,0 +1,7 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this is needed? By default, the way XrdOssAddStorageSystem2 has been declared automatically makes it visible. Any other symbols are visible when made extern, making everything visible is generally not good practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be reading this backward. This makes all symbols local to the library (not appearing in the symbol table) except for XrdOssAddStorageSystem2*. If I didn't have the exception listed in the global section, then it would hide XrdOssAddStorageSystem2* as well -- not what we want!

Any other symbols are visible when made extern, making everything visible is generally not good practice.

That's not how C++ works. The extern keyword specifies the storage class for the variable or the language linkage property for functions. It does not affect symbol visibility in the resulting module.

For example, consider the SciTokens library (which doesn't manage symbol visibility:

$ nm -D lib64/libXrdAccSciTokens-5.so | c++filt | grep -v ' U ' | wc -l
176

Quite a bit of "stuff" -- all not needed to be visible from the outside because, really, all that's needed is XrdAccAuthorizeObjAdd and friends. Loading it pollutes up the global symbol table.

This is not because extern is or is not used -- but because C++ exports all the namespace-level symbols globally by default when dynamic linking.

Compare this to libXrdMacaroons where we do manage the symbol visibility:

$ nm -D lib64/libXrdMacaroons-5.so | c++filt | grep -v ' U ' | wc -l
14

The list is short enough I can paste it into the comment -- and it looks reasonable:

$ nm -D build/release_dir/lib64/libXrdMacaroons-5.so | c++filt | grep -v ' U ' 
                 w __cxa_finalize
000000000000e678 t _fini
                 w __gmon_start__
0000000000004000 t _init
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 w __pthread_key_create
0000000000014618 B SciTokensHelper
0000000000005be0 T XrdAccAuthorizeObjAdd
0000000000014540 D XrdAccAuthorizeObjAdd_
0000000000005c50 T XrdAccAuthorizeObject
00000000000145a0 D XrdAccAuthorizeObject_
0000000000005f20 T XrdHttpGetExtHandler
00000000000144e0 D XrdHttpGetExtHandler_

You only have things that are desired to be exported -- and basically required for a shared library (_init, _fini).

Most of the Xrootd modules export way more than intended: based on your review comment, I wanted to tidy up this one.

Copy link
Member

@abh3 abh3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I read it backwards. Thanks for the explanation.

static_cast<float>(m_slow_times.m_open)/1e9, static_cast<float>(m_slow_times.m_read)/1e9, static_cast<float>(m_slow_times.m_readv)/1e9,
static_cast<float>(m_slow_times.m_pgread)/1e9, static_cast<float>(m_slow_times.m_write)/1e9, static_cast<float>(m_slow_times.m_pgwrite)/1e9,
static_cast<float>(m_slow_times.m_dirlist)/1e9, static_cast<float>(m_slow_times.m_stat)/1e9, static_cast<float>(m_slow_times.m_truncate)/1e9,
static_cast<float>(m_slow_times.m_unlink)/1e9, static_cast<float>(m_slow_times.m_rename)/1e9, static_cast<float>(m_slow_times.m_chmod)/1e9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do all these divisions when you could multiply with 1e-9f?

@amadio
Copy link
Member

amadio commented Feb 7, 2025

This looks like it's ready for merging. However, if you don't want me to squash commits, please merge fixes into the appropriate commits they are fixing (i.e. the commit with "Address review comments" should just be a fixup of the previous commit, most likely, same for using RAtomic, the code should be added like that in one go).

@amadio amadio force-pushed the devel branch 2 times, most recently from e9dc92a to c958645 Compare February 14, 2025 12:36
With this, any OSS plugin may use the gstream monitoring.
The storage statistics OSS keep track of number of server-wide
requests and performance, reporting this information through the
g-stream.

It includes separate counters for "slow" requests, allowing one to
get a feel for how many requests go through the server over a
configurable threshold (default: 1s).
After refactoring to use the common wrapper base clase, multiple methods
in the file, directory, and filesystem for OSS stats were now identical
to the default implementation from the wrapper.  Remove them to reduce
redundant code.
In particular:
- Touchup class naming scheme to better match the rest of XRootD
- Avoid throwing exceptions on failure; add check to see if the resulting
  object is valid and either fail or bypass it.
- Hide symbol visibility outside the module.
Particularly,
- Use `RAtomic` classes where possible.
- Cleanup pointer on failure.
- Move internal functions into the `XrdOssStats::detail` namespace
@amadio
Copy link
Member

amadio commented Feb 18, 2025

Updated this to the current devel branch. Will merge it once the CI is green.

@amadio amadio merged commit e59d8e9 into xrootd:devel Feb 19, 2025
11 checks passed
@amadio amadio added this to the 5.8.0 milestone Feb 19, 2025
@amadio
Copy link
Member

amadio commented Feb 24, 2025

DEB/RPM generation is not working after merging this into devel. In order to move this to master to release this with 5.8, this needs to be addressed.

@amadio
Copy link
Member

amadio commented Feb 24, 2025

Please see the errors for RPM and DEB package generation at the respective links.

@amadio
Copy link
Member

amadio commented Feb 24, 2025

It was easy to fix, so fixed in dcaa75f, 86ef56b and 560e360.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants