Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

. #332

Open
wants to merge 64 commits into
base: master
Choose a base branch
from
Open

. #332

wants to merge 64 commits into from

Conversation

vipvip1811
Copy link

@vipvip1811 vipvip1811 commented Jun 7, 2021

No description provided.

@vipvip1811 vipvip1811 changed the title error not run on AMD RX570. After load, it make my PC not responding. . Jun 7, 2021
@Maroc-OS
Copy link

Maroc-OS commented Jun 9, 2021

it works on AMD GPU but cl version is known to problems and not work good, it skips addresses. here is a log on a single GPU, and BTW someone reached that number of MKey/s on same gpu :)

[2021-06-09.13:29:31] [Info] CleanedBitCrack
                             
[2021-06-09.13:29:31] [Info] Compression: both
[2021-06-09.13:29:31] [Info] Starting at: 00000000000000000000000000000000000000000000000017915F2D9CC2CD98
[2021-06-09.13:29:31] [Info] Ending at:   00000000000000000000000000000000000000000000000018F612FF88F64830
[2021-06-09.13:29:31] [Info] Counting by: 0000000000000000000000000000000000000000000000000000000000000001
[2021-06-09.13:29:31] [Info] Threads: 128
[2021-06-09.13:29:31] [Info] Blocks: 64
[2021-06-09.13:29:31] [Info] Points per Thread: 4000
[2021-06-09.13:29:31] [Info] Compiling OpenCL kernels...
[2021-06-09.13:29:31] [Info] Initializing AMD Radeon Pro 560X Compute Engine
[2021-06-09.13:29:31] [Info] Allocating Memory for Buffers (4000.0MB)
[2021-06-09.13:29:32] [Info] Generating 32,768,000 starting points (1250.0MB)
[2021-06-09.13:29:35] [Info] 10.0%
[2021-06-09.13:29:36] [Info] 20.0%
[2021-06-09.13:29:36] [Info] 30.0%
[2021-06-09.13:29:36] [Info] 40.0%
[2021-06-09.13:29:36] [Info] 50.0%
[2021-06-09.13:29:36] [Info] 60.0%
[2021-06-09.13:29:36] [Info] 70.0%
[2021-06-09.13:29:36] [Info] 80.0%
[2021-06-09.13:29:36] [Info] 90.0%
[2021-06-09.13:29:36] [Info] 100.0%
[2021-06-09.13:29:36] [Info] Done
[2021-06-09.13:29:36] [Info] Loading addresses from '/Users/research/btctest/in.txt'
[2021-06-09.13:30:46] [Info] xxxxxxxxx address(es) loaded (234MB)
                             0 address(es) ignored
[2021-06-09.13:30:47] [Info] Initializing BloomFilter (512.0MB)
[00:03:05] 6512/4096MB | xxxxxxxxx targets 8014.80 MKey/s (100,401,335,142,415,000 remaining) [ETA 10.08 months]

xxxxxxxxx address(es) loaded (234MB) modded here.

This fork is 80% improved in coding style, speed and cross-platform compatibility and added almost 50% of @Uzlopak 's additions from past week.

@Uzlopak
Copy link

Uzlopak commented Jun 10, 2021

Wow.

First of all... I think this is not mergable, as I remoevd cuda files, because I could not guarantee that the cuda part would still work after my modifications. So probably if we fork again and overwrite the files with the changed ones it will be an acceptable merge.

Secondly: How did you get it to 8014.80 MKey/s? I just get 360 MKey/s at best from my Vega56. And my Vega56 should be about 250% faster than your Radeon 560?

Or is maybe my Card stronger than I know and my overall system is too slow (too old CPU and DDR3 RAM?)

@Maroc-OS
Copy link

i already sent you a message in gmail when you started the pull.
okay my tool is working both cuda and cl and it was a cleaned version from two years or so plus my changes and some of ppl changes like yours.

i have keept the -r and must of it's features. and maybe cuda must get some tweaks too.

i have some modds here with your cl files and more fixes and tweaks i will push them now without rebasing and making them small pushes. take a look at it.

https://github.com/MarocOS/CleanedBitcrack

okay there still a lot of things must be fixed and/or merged manually like error reporting and some other stuff to port all of your changes but you don't have to remove cuda as it's the perfect working one cl version always skip a lot of keys.

@Maroc-OS
Copy link

optimal parameters that i found is this : blocks = 64, threads = (blocks or double blocks), pointsperthread = 3/4 card ram or even 4/4 ram you will get the best performance : 8014.80 MKey/s

keep in mind treat 4GB as 4000mb not 4096mb

i was trying automatic solution based on hardware capabilities but it seems not working good.

@Maroc-OS
Copy link

Secondly: How did you get it to 8014.80 MKey/s? I just get 360 MKey/s at best from my Vega56. And my Vega56 should be about 250% faster than your Radeon 560?

My system is a iMac i5 4k 2019 with AMD Radeon Pro 560X Compute Engine

@Uzlopak
Copy link

Uzlopak commented Jun 11, 2021

Hi @Maroc-OS

I did not get any E-Mail. I checked my E-Mails but nothing :/. Maybe you send me again an E-Mail? [email protected] .

Btw. I deleted in my Branch _stepKernelWithDouble. In hindsight it was maybe wrong. So dont remove it from your Branch.

I personally think maxed out the possibilities from my side. Maybe some dynamic parallelism in ripemd160 hash, because you can do the rounds in parallel. Thats why I prepared it to be two separate functions. Had hired a dev on fiverr, because I had no time to figure out how to do the dynamic parallism implemention but he never did it.

Other than that, we will get not further performance gains without more math. E.g. invModP could be improved by using the extended euclidian algorithm to get the inverse in non-time-linear (=faster) manner, as the little fermat solution is time-linear and is used by the secp256k1 solutions to ensure that there is no side-channel-attack. Also invModP is using like 256 x multiplication. So it is the biggest bottleneck in the whole algorithm as it is called n-times. But how to implement it with the extended euclidian? Dont know.

Maybe also multiplication is also slow.

Other than that I suppose there is still some speed to gain, by reducing the global variables into private variables. If I understand __constant correctly it is an alias for the global memory. So potentially by using them directly per #define and creating e.g. sub256kP method where you directly use P_7, P_6... instead of the memory. This could reduce the time for the lookup in the global memory, and speed up the whole calculations significantly.

I also suspect that I "improved" greaterOrEqualToP wrong. Should be of course:

#define greaterOrEqualToP(a)    \
    (                           \
        (a[0] == 0xffffffff) && \
        (a[1] == 0xffffffff) && \
        (a[2] == 0xffffffff) && \
        (a[3] == 0xffffffff) && \
        (a[4] == 0xffffffff) && \
        (a[5] == 0xffffffff) && \
        (a[6] >= 0xfffffffe) && \
        (a[7] >= 0xfffffc2f)    \
    )

I also implemented a generatePublicKey function. You can see this in my "different"-branch. To test it run it with -b 1 -p 1 or else it will not work. It is super slow, as it is using the super slow inverseMod and mulModP implementation. I renamed some functions, so probably will not work by transplanting it directly into bitcoin.cl or secp256k1.cl. Maybe it is useful, as it could make bitcrack more modular and starting point of future modifications and new products.

My actual assumption is, that we currently use multiple GB of global Memory on the GPU. So the biggest bottleneck currently is the constant read and write on the Global Memory. So to gain speed it would be necessary to reduce the memory lookup. So calculating the first public key is very expensive. But then we could actually do the point addition, what we currently anyway do when calling those batch functions, on the fly. Tbh I did not figure out how original bitcrack does keep the inverse in those batch methods. So if we could have this smart inverse methodic kept, we could avoid doing the costly inverse. So we would just calculate the public key at the beginning of the stepKernel and then we would calculate all pubkeys on the fly.

So what we would do then is instead of doing just 4000 points per thread we would crank it up to e.g. 65536 keys per thread.

Would be glad if someone would find a way to improve the generatePublicKey function. E.g. by using Jacobian Points, so you have to do only one invModP at the end and not all the time.

@Maroc-OS
Copy link

Maroc-OS commented Jun 11, 2021

hello again,

Hi @Maroc-OS

I did not get any E-Mail. I checked my E-Mails but nothing :/. Maybe you send me again an E-Mail? [email protected] .

yeah i used the one on your github account [email protected] :)

Btw. I deleted in my Branch _stepKernelWithDouble. In hindsight it was maybe wrong. So dont remove it from your Branch.

nop it still there i did not removed it.

I personally think maxed out the possibilities from my side. Maybe some dynamic parallelism in ripemd160 hash, because you can do the rounds in parallel. Thats why I prepared it to be two separate functions. Had hired a dev on fiverr, because I had no time to figure out how to do the dynamic parallism implemention but he never did it.

and

Maybe also multiplication is also slow.

Other than that I suppose there is still some speed to gain, by reducing the global variables into private variables. If I understand __constant correctly it is an alias for the global memory. So potentially by using them directly per #define and creating e.g. sub256kP method where you directly use P_7, P_6... instead of the memory. This could reduce the time for the lookup in the global memory, and speed up the whole calculations significantly.

I also suspect that I "improved" greaterOrEqualToP wrong. Should be of course:

#define greaterOrEqualToP(a)    \
    (                           \
        (a[0] == 0xffffffff) && \
        (a[1] == 0xffffffff) && \
        (a[2] == 0xffffffff) && \
        (a[3] == 0xffffffff) && \
        (a[4] == 0xffffffff) && \
        (a[5] == 0xffffffff) && \
        (a[6] >= 0xfffffffe) && \
        (a[7] >= 0xfffffc2f)    \
    )

I also implemented a generatePublicKey function. You can see this in my "different"-branch. To test it run it with -b 1 -p 1 or else it will not work. It is super slow, as it is using the super slow inverseMod and mulModP implementation. I renamed some functions, so probably will not work by transplanting it directly into bitcoin.cl or secp256k1.cl. Maybe it is useful, as it could make bitcrack more modular and starting point of future modifications and new products.

the whole cl files must get a review and you started that task and we can make that better.

Other than that, we will get not further performance gains without more math. E.g. invModP could be improved by using the extended euclidian algorithm to get the inverse in non-time-linear (=faster) manner, as the little fermat solution is time-linear and is used by the secp256k1 solutions to ensure that there is no side-channel-attack. Also invModP is using like 256 x multiplication. So it is the biggest bottleneck in the whole algorithm as it is called n-times. But how to implement it with the extended euclidian? Dont know.

we should take a look at nvidia,amd and intel hardware capabilities and implementations, we could use faster cl code by targeting each vendor separately and try to use the hardcoded capabilities on those compute engines if we can say that term.

My actual assumption is, that we currently use multiple GB of global Memory on the GPU. So the biggest bottleneck currently is the constant read and write on the Global Memory. So to gain speed it would be necessary to reduce the memory lookup. So calculating the first public key is very expensive. But then we could actually do the point addition, what we currently anyway do when calling those batch functions, on the fly. Tbh I did not figure out how original bitcrack does keep the inverse in those batch methods. So if we could have this smart inverse methodic kept, we could avoid doing the costly inverse. So we would just calculate the public key at the beginning of the stepKernel and then we would calculate all pubkeys on the fly.

So what we would do then is instead of doing just 4000 points per thread we would crank it up to e.g. 65536 keys per thread.

it depends on the gpu memory, and as you see in my last examplewhen used 4000 points per thread the calculations got us to 6512/4096MB we are here using more memory than the actual gpu i don't know how this was possible but when adding more you get cl_invalid_value or maybe cl_cannot_allocate_memory.
also i can already say that read and write a restricted using CL_MEMORY_READ/WRITE etc. but your idea can be done if we optimize the code more. i was trying to get AMD CODE_XL to try optimizing the code but no macOS versions there.

Would be glad if someone would find a way to improve the generatePublicKey function. E.g. by using Jacobian Points, so you have to do only one invModP at the end and not all the time.

we could try this if you want

PS: some resources that can help
OCLoptimizer: an iterative optimization tool for OpenCL
AMD CodeXL

@vipvip1811
Copy link
Author

vipvip1811 commented Jun 12, 2021 via email

@jamiegreen7
Copy link

20000000000000000:3ffffffffffffffff -c 13zb1hQbWVsc2S7ZTZnP2G4undNNpdh5so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants