Incorrect granule sizes #785

betolink · 2024-08-13T16:45:31Z

earthaccess is reporting incorrect granule sizes because we are not correctly parsing the UMM path that should contain the size. Usually granules map to a single file and if the granule metadata contains the size in MB all works as expected, if the granule contains multiple files and/or the units are not MB, the reported size will be incorrect.

This issue was reported by David Giles

Example:

{
   "DataGranule":{
      "ArchiveAndDistributionInformation":[
         {
            "Name":"CAL_LID_L2_VFM-Standard-V4-51.2022-01-15T23-28-52ZD.hdf"
         },
         {
            "Checksum":{
               "Algorithm":"MD5",
               "Value":"45bf3cf50f837a0db6350c3c6bcd3356"
            },
            "Name":"CAL_LID_L2_VFM-Standard-V4-51.2022-01-15T23-28-52ZD.hdf.met",
            "Size":"8.2265625",
            "SizeUnit":"KB"
         },
         {
            "Checksum":{
               "Algorithm":"MD5",
               "Value":"9d2bbf8e8fa88c2b105da6b7a9940093"
            },
            "Name":"CAL_LID_L2_VFM-Standard-V4-51.2022-01-15T23-28-52ZD.hdf",
            "Size":"47.05515384674072",
            "SizeUnit":"MB"
         }
      ]
   }
}

This is granule is a great example, contains multiple files in different units. The correct size should be ~47MB + 8KB

The method tat needs to be updated is

earthaccess/earthaccess/results.py

Line 253 in be1ec48

def size(self) -> float:

and we should also handle cases where the information is not there.

chuckwondo · 2024-08-13T18:38:26Z

Unfortunately, this cannot be computed unambiguously because the supplied size values are not necessarily consistently computed. For example, some providers might compute an MB value as bytes / 1000 / 1000, whereas others might compute it as bytes / 1024 / 1024, so we have no way of knowing whether to multiply by 1000^2 or 1024^2 to get the number of bytes.

Unfortunately, this was a poor design decision in the UMM, and it should have simply been designed such that the size reported in the metadata is always bytes (which also avoids rounding errors, even if we know for sure what to multiply by). This is why the UMM was later modified to include a SizeInBytes metadata value.

UMM-G v1.6 added SizeInBytes. See the description in the schema, which describes exactly this problem.

With that said, I don't currently have a suggestion for a sensible solution to this.

Further, even if the above were not the case (i.e., even without any ambiguity), I'm not sure that computing the "size" as the sum of individual sizes makes sense. I suppose it might make sense if you want to know the total volume that would be downloaded if all files in the granule were downloaded, but I'm not sure that's a common use case.

Even if that is a common use case, I would think we should also include a mechanism for users to obtain individual file sizes as well (again, ignoring the ambiguity mentioned above).

One path to explore for the size ambiguity might be to provide some sort of size_hint method. When a granule does not specify a SizeInBytes value (which is unambiguous), then size_hint could potentially assume powers of 2 multipliers (e.g., KB=1024, MB=1024^2, etc., which would possibly overestimate sizes, which is perhaps better than underestimating by using powers of 1000). This would be similar to Python's own __length_hint__ vs __len__ (see https://docs.python.org/3/reference/datamodel.html#object.__length_hint__).

In addition, we might want to provide some sort of size_in_units method that does no computation (i.e., makes no assumptions), simply returning perhaps a tuple of (float, str), where the first value is the float of the size attribute, and the second value is the sizeunit string, so the user can then choose how they wish to deal with it. For example (taking from your sample metadata above): (47.05515384674072, "MB")

betolink · 2024-08-14T01:24:20Z

What about having size_hint() as a fallback, but in the above example we could take into account the reported units and values to sum the files in the granule. If there is only a 10203445 and no reported unit then yes we can just pass it as is.

chuckwondo · 2024-08-18T15:00:35Z

What about having size_hint() as a fallback, but in the above example we could take into account the reported units and values to sum the files in the granule. If there is only a 10203445 and no reported unit then yes we can just pass it as is.

I think what we must first do is clearly define the use cases and requirements around the use of any type of size "computation" we want to support. Without gaining some clarity around what we want/need, there's little sense in discussing how to implement anything.

What specifically do we want to support/provide through a size method/function and any potentially related methods/functions, such as perhapssize_in_units?

asteiker · 2024-08-26T18:48:57Z

How does Earthdata Search currently handle granule size estimation? I know that they provide an estimated size upon ordering (see screenshot). Maybe we could leverage their work? https://github.com/nasa/earthdata-search

betolink · 2024-08-27T21:51:26Z

Great suggestion @asteiker! this is what they say:

This is the estimated overall size of your project. If no size
information exists in a granule&apos;s metadata, it will not be
included in this number. The size is estimated based upon the
first 20 granules added to your project from each collection.

And they seem to convert units into a common unit https://github.com/nasa/earthdata-search/blob/619d533e53906550ed6428162c25b4878d858768/static/src/js/util/project.js#L8 (there is more code)

So I think we should follow a similar logic to have consistency on what users see in the NASA portal. Maybe we can be even more accurate with the size when we have that data available. And this relates to a conversation we had @chuckwondo about having some lazy loading of the results, I don't remember very well if we talked/covered using a "resultset" class where we could paginate the results from CMR etc. As for now I think we should implement the following:

If a granule has complete metadata on size and units, we should sum them up and report them to the user in granule.size() if a granule has incomplete metadata we should perhaps only pass the data as is (tuples like you mentioned) or we could implement a size_hint(). What do you all think? cc @jhkennedy

github-actions bot mentioned this issue Sep 1, 2024

Monthly issue metrics report #796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect granule sizes #785

Incorrect granule sizes #785

betolink commented Aug 13, 2024 •

edited

Loading

chuckwondo commented Aug 13, 2024

betolink commented Aug 14, 2024 •

edited

Loading

chuckwondo commented Aug 18, 2024 •

edited

Loading

asteiker commented Aug 26, 2024

betolink commented Aug 27, 2024

Incorrect granule sizes #785

Incorrect granule sizes #785

Comments

betolink commented Aug 13, 2024 • edited Loading

chuckwondo commented Aug 13, 2024

betolink commented Aug 14, 2024 • edited Loading

chuckwondo commented Aug 18, 2024 • edited Loading

asteiker commented Aug 26, 2024

betolink commented Aug 27, 2024

betolink commented Aug 13, 2024 •

edited

Loading

betolink commented Aug 14, 2024 •

edited

Loading

chuckwondo commented Aug 18, 2024 •

edited

Loading