Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

current process crashes when downloading fle #89

Open
sescobb27 opened this issue May 12, 2020 · 6 comments
Open

current process crashes when downloading fle #89

sescobb27 opened this issue May 12, 2020 · 6 comments

Comments

@sescobb27
Copy link

Environment

  • Elixir & Erlang versions (elixir --version):
Erlang/OTP 22 [erts-10.5] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [hipe]

Elixir 1.10.2 (compiled with Erlang/OTP 22)
  • ExAws version
* ex_aws 2.1.3 (Hex package) (mix)
  locked at 2.1.3 (ex_aws) 0bdbe2ae
* ex_aws_s3 2.0.2 (Hex package) (mix)
  locked at 2.0.2 (ex_aws_s3) 0569f5b2
  • HTTP client version. IE for hackney do mix deps | grep hackney
* hackney 1.15.2 (Hex package) (rebar3)
  locked at 1.15.2 (hackney) e0100f8e

Current behavior

Hi when trying to download multiple files at once i'm getting the following error, the problem is that it seems that is causing the current process to crash as is not returning an error tuple, i think is because at download operation async_stream is being used and that links to current process, but not sure if thats the reason see https://github.com/ex-aws/ex_aws_s3/blob/master/lib/ex_aws/s3/download.ex#L71-L93 and from docs

The tasks will be linked to the current process, similarly to async/1.

https://hexdocs.pm/elixir/Task.html#async_stream/5

besides of that i'm not seeing any other stack trace, error log or anything that helps me better diagnose the problem, but at current process i'm logging errors and also i tried rescuing without success so that's why i think this may be the reason

May 12 15:35:12 titan-media-parser-01 media_parser[1307]:     Args: [#Function<0.39970933/1 in ExAws.Operation.ExAws.S3.Download.download_to/3>, [%{end_byte: 16252927999, start_byte: 16200499200}]]
May 12 15:35:12 titan-media-parser-01 media_parser[1307]: Function: &:erlang.apply/2
May 12 15:35:12 titan-media-parser-01 media_parser[1307]:     (stdlib 3.12) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
May 12 15:35:12 titan-media-parser-01 media_parser[1307]:     (elixir 1.10.2) lib/task/supervised.ex:35: Task.Supervised.reply/5
May 12 15:35:12 titan-media-parser-01 media_parser[1307]:     (elixir 1.10.2) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
May 12 15:35:12 titan-media-parser-01 media_parser[1307]:     (ex_aws_s3 2.0.2) lib/ex_aws/s3/download.ex:76: anonymous fn/4 in ExAws.Operation.ExAws.S3.Download.download_to/3
May 12 15:35:12 titan-media-parser-01 media_parser[1307]:     (ex_aws_s3 2.0.2) lib/ex_aws/s3/download.ex:21: ExAws.S3.Download.get_chunk/3
May 12 15:35:12 titan-media-parser-01 media_parser[1307]:     (ex_aws 2.1.3) lib/ex_aws.ex:66: ExAws.request!/2
May 12 15:35:12 titan-media-parser-01 media_parser[1307]: {:error, :checkout_timeout}
May 12 15:35:12 titan-media-parser-01 media_parser[1307]: ** (ExAws.Error) ExAws Request Error!
May 12 15:35:12 titan-media-parser-01 media_parser[1307]: 15:35:12.780 [error] Task #PID<0.7083.0> started from #PID<0.8148.0> terminating
May 12 15:35:12 titan-media-parser-01 media_parser[1307]: 15:35:12.779 [warn]  ExAws: HTTP ERROR: :checkout_timeout for URL: "..." ATTEMPT: 10

Expected behavior

to not crash current process, but instead return error tuple

@sescobb27
Copy link
Author

Hi there, any update on this? can i help fixing this? (I think it would need async_stream_nolink) or you thing is not a problem from the lib? or should i go the easy way and just trap exits on my processes?

@sescobb27
Copy link
Author

the same happens with S3.download_file and with S3.upload

@sescobb27
Copy link
Author

A proposed solution would be something like this, it will have the same current behavior but with the advantage that can be rescued

NOTE: we would need a way to pass the name of the TaskSupervisor maybe using config

    def perform(op, config) do
      with {:ok, op} <- Upload.initialize(op, config) do
        stream = Stream.with_index(op.src, 1)

        TaskSupervisor
        |> Task.Supervisor.async_stream_nolink(
          stream,
          Upload,
          :upload_chunk!,
          [Map.delete(op, :src), config],
          max_concurrency: Keyword.get(op.opts, :max_concurrency, 4),
          timeout: Keyword.get(op.opts, :timeout, 30_000)
        )
        |> Enum.map(fn
          {:ok, val} -> val
          {:exit, {error, _}} -> raise error
        end)
        |> Upload.complete(op, config)
      end
    end

@jimsynz
Copy link

jimsynz commented Feb 9, 2021

We're seeing the same thing in our system too:


18:26:50.155 [error] #PID<0.22527.61> running NarrativeService.APIWeb.Endpoint (cowboy_protocol) terminated
--
Server: content1.getnarrativeapp.com:80 (http)
Request: GET /static/***REDACTED***
** (exit) an exception was raised:
** (HTTPoison.Error) :checkout_timeout
(httpoison) lib/httpoison.ex:156: HTTPoison.request!/5
(elixir) lib/stream.ex:1362: anonymous fn/5 in Stream.resource/3
(elixir) lib/enum.ex:2979: Enum.reduce/3
(api) lib/api_web/controllers/image_controller.ex:1: NarrativeService.APIWeb.ImageController.action/2
(api) lib/api_web/controllers/image_controller.ex:1: NarrativeService.APIWeb.ImageController.phoenix_controller_pipeline/2
(api) lib/api_web/endpoint.ex:1: NarrativeService.APIWeb.Endpoint.instrument/4
(phoenix) lib/phoenix/router.ex:278: Phoenix.Router.__call__/1
(api) lib/api_web/endpoint.ex:1: NarrativeService.APIWeb.Endpoint.plug_builder_call/2
18:26:50.782 [warn] ExAws: HTTP ERROR: :checkout_timeout for URL: "https://s3.amazonaws.com/***REDACTED***" ATTEMPT: 5

My first thought was maybe pool exhaustion. Any thoughts on this @edgurgel?

@sescobb27
Copy link
Author

@jimsynz i think :checkout_timeout is indeed pool exhaustion, you may need to increase pool size, or to not use pooling at all, both solutions can work for you. but you need to also know that by increasing the pool size, you may find this error again.

@jimsynz
Copy link

jimsynz commented Feb 9, 2021

Yeah. Looking at https://github.com/benoitc/hackney/issues/ it looks like there has been a bunch of problems with the default pool of late.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants