Tuesday, May 15, 2018

PiggyFuse HVSP AVR fuse programmer

Although I've been working with AVR MCUs for a number of years now, I had never made a high voltage programmer.  I've seen some HVSP fuse resetter projects I liked, but I don't have a tiny2313.  I think I was also hesitant to hook up 12V to an AVR, since I had fried my first ATMega328 Pro Mini by accidentally connecting a 12V source to VCC.  However, if you want to be an expert AVR hacker, you'll have to tackle high-voltage programming.  Harking back to my Piggy-Prog project, I realized I could do something similar for a fuse resetter, which would simplify the wiring and reduce the parts count.

I considered using a charge pump to provide 12V, like some other HVSP projects do, but adding at least 3 diodes and capacitors would more than double the parts count.  I also realized that most AVR hackers probably have a 12V power source available.  Old ATX power supplies have 12V on the 3.5" floppy connector, which 0.1" pin headers easily plug into.  Old DSL modems and home routers often run from a 12V DC supply.  I decided to use a 14.4V tool battery with a small switching converter.  I even thought of using one of my TL431s, but hooking up a few alligator clips to the switching converter was quicker.

Instead of just copying another program verbatim, I decided to implement the core of the programming algorithm myself.  The AVR datasheets list two algorithms, though all of the HVSP programs I could find followed the first algorithm and not the alternative one.  Both algorithms are somewhat obtuse, and even seem to contradict the datasheet specifications that state the minimum latching time for Prog_enable is only 100ns.

After debugging with my oscilloscope on a tiny13 and a tiny85, I realized that the two parts have different ways of entering HVSP mode.  The tiny13 will enter programming mode when it powers up and finds 12V on the reset pin, while the tiny85 requires the reset pin to be at 0V on power-up before applying 12V.  Although the datasheet doesn't explicitly state it, the target drives SDO high when it has entered HSVP mode and is ready for commands.  In the case of the tiny85, that happens about 300us after applying 12V to the reset pin.  However with the tiny13, that happens much sooner, around 10us.  This means the datasheet's recommendation to hold SDO low for at least 10us 12V has been applied is not only wrong, it's potentially damaging.  During my experimenting, I observed my tiny13 attempting to drive SDO high while the programmer was still holding it low.  That caused VCC to drop from 5V to 4V, likely approaching the 40mA maximum I/O current maximum.  And since the datasheet specifies a minimum VCC of 4.5V for HSVP, the droop to 4V could cause programming errors.  In the scope image, the yellow line is VCC, and the blue line is SDO.

Instead of waiting 300us after applying 12V to send any commands, I considered just waiting for SDO to go high.  While this would work fine for the ATtiny13 and ATtiny85, it's possible some other parts drive SDO high before they are ready to accept commands.  Therefore I decided to stick with the 300us wait.  To avoid the contention on SDO shown in the scope image above, I switch SDO to input immediately after applying 12V.  Since it is grounded up to that point removing any charge on the pin, it's not going to float high once it is switched to input.

Another source of potential problems with other HSVP projects is the size of the resistor on the 12V reset pullup.  I measured as much as 1mA of current through reset pin on a tiny13 when 12V was applied, so using a 1K resistor risks having the voltage drop below the 11.5V minimum required for programming.
I recommend using around 470 Ohms, and use a 12.5V supply if possible.

Putting it all together

As shown in the very first photo, I used a mini-breadboard with the 8-pin ATtiny positioned so that the Pro Mini will plug in with its raw input at the far right.  In the photo I have extra wires connected for debugging.  The only required ones are as follows:
12V supply to RAW on the Pro Mini
GND supply to GND on the Pro Mini
Pullup resistor from 12V supply to ATtiny reset (pin 1)
NPN collector to ATtiny reset
NPN base to Pro Mini pin 11
NPN base to Pro Mini pin 12 & ATtiny GND
ATtiny VCC (pin 8) to Pro Mini pin 6

The Pro Mini must be a 5V version.  The optional resistor from the Pro Mini pin 6 lights the green LED when the Pro Mini has successfully recognized a target ATtiny.  Although the program outputs logs via the serial UART at 57.6kbps, it uses the on-board pin 13 LED to allow for stand-alone operation.  Failure is a long flash for a second followed by 2 seconds off.  Success is one or two short flashes.  One flash is for my preferred debug fuse settings with 1.8V BOD and DWEN along with RSTDISBL.  Two flashes is for my "product" fuse settings with RSTDISBL.  The program will alternate between debug and product fuses each time the Pro Mini is reset.

The code is in my github repo:

And finally, the money shot:

Saturday, April 21, 2018

Debugging debugWire

Many modern AVRs have an on-chip one-wire debugger called debugWire that uses the RESET pin when the DWEN fuse is programmed.  The AVR manuals provide no details on the protocol, and the physical layer description is rather terse: "a wire-AND (open-drain) bi-directional I/O pin with pull-up enabled".  While much of the protocol has been reverse-engineered, my initial experiments with debugWIRE on an ATtiny13 were unreliable.  Suspecting possible issues at the physical layer, I got out my scope to do some measurements.

I started with a single diode as recommended by David Brown.  I used a terminal program to send a break to the AVR, which responds with 0x55 about 150us after the break.  As can be seen in the scope image above, the rise time can be rather slow, especially with Schottky diodes since they have much higher capacitance than standard diodes like a 1N4148.  Instead of a diode, Matti Virkkunen recommends a 1K resistor for debugWIRE.  For UPDI, which looks like an updated version of debugWire, a 4.7K resistor is recommended.  I ended up doing a number of tests with different resistor values, as well as tests with a few transistor-based circuits.  While the best results were with a 2N3904 and a 47K pull-up to Vcc on the base, I achieved quite satisfactory results with a 1.4K resistor:

The Tx low signal from both the Pl2303 and the AVR are slightly below 500mV, which both sides consistently detect as a logic level 0.  With 3.5Vcc, I found that levels above 700mV were not consistently detected as 0.  As can be seen from the scope image, the signal rise time is excellent.  The Tx low from the Pl2303 is slightly lower than the Tx low from the AVR, so a 1.5K resistor would likely be optimal instead of 1.4K.

You might notice in the scope image that there are two zero frames before the 0x55 response from the AVR.  The first is a short break sent to the AVR, and the second is the break that the AVR sends before the 0x55.  While some define a break as a low signal that lasts at least two frame times, the break sent by the AVR is 10 bit-times.  Since debugWire uses 81N, a transmitted zero will be low for 8 bits plus the start bit before going high for the stop bit.  That means the longest valid frame will be low for 9 bit-times, and anything more than 10 bit-times low can be assumed to be a break.  Another thing I discovered was that the AVR does not require a break to enter dW mode after power-up.  A zero frame (low for 9 bit-times) is more than enough to activate dW mode, stopping the execution on the target.  Once in dW mode, a subsequent zero frame will cause the target to continue running, while continuing to wait for additional commands.

My results with a 1.4K resistor are specific to the ATtiny13 + Pl2303 combination I am using.  A different USB-TTL dongle such as a CP2102 or CH340G could have different Tx output impedance, so a different resistor value may be better.  A more universal method would be to use the following basic transistor circuit:

One caveat to be aware of when experimenting with debugWIRE is that depending on your OS and drivers, you may not be able to use custom baud rates.  For example under Windows 7, I could not get my PL2303 adapters to use a custom baud rate.  Under Linux, after confirming from the driver source that custom baud rates are supported, I was eventually able to set the port to the ~71kbps baud rate I needed to communicate with my tiny13.  That adventure is probably worthy of another blog post.  For a sample of what I'll be discussing, you can look at my initial attempt at a utility to detect the dW baud rate of a target AVR.

Tuesday, April 3, 2018

TTL USB dongles: Hacker's duct tape

For micro-controller projects, a TTL serial UART has a multitude of uses.  At a cost that is often under $1, it's not hard to justify having a few of them on hand.  I happen to have several

The first and probably simplest use is a breadboard power supply.  Most USB ports will provide at least 0.5A of 5V power, and the 3.3V regulator built into the UART chip can supply around 250mA.  With a couple of my dongles, I used a pair of pliers to straighten the header pins in order to plug them easily into a breadboard.

I've previously written about how to make an AVR programmer, although now that USBasp clones are widely available for under $2, there is little reason to go through the trouble.  Speaking of the USBasp, they can also be used along with TTL USB dongle to do 2.4Msps 2-channel digital capture.

Since TTL dongles usually have Rx and Tx LEDs, they can be used as simple indicator lights.  To show that a shell script has finished running, just add:
$ cat < /dev/zero > /dev/ttyUSB0
to the end of the script.  The continuous stream of zeros will mean the Tx LED stays brightly illuminated.  To cause the Tx LED to light up or flash for a specific period of time, set the baud rate, then send a specific number of characters:
$ stty -F /dev/ttyUSB0 300
$ echo -n '@@@@@@@@' > /dev/ttyUSB0
Adding a start and a stop bit to 8-bit characters means a total of 10 bits transmitted per characters, so sending 8 characters to the port will take 80 bit-times, which at 300 baud will take 267ms.

It's also possible to generate a clock signal using a UART.  The ASCII code for the letter 'U' is 0x55, which is 01010101 in 8-bit binary.  A UART transmits the least significant bit first, so after adding the start bit (0) and stop bit (1), the output is a continuous stream of alternating ones and zeros.  Simply setting the port to a high baud rate and echoing a stream of Us will generate a clock signal of half the baud rate.  Depending on OS and driver overhead, it may not be possible to pump out a continuous stream of data sending one character at a time from the shell.  Therefore I created a small program in C that will send data in 1KB blocks.  Using this program with a 3mbps baud rate I was able to create a 1.5Mhz clock signal.

If you have server that you remote monitor, you can use TTL dongles to reset a server.  This works best with paired servers, where one server can reset the other.  The RESET pin on a standard PC motherboard works by being pulled to ground, so the wiring is very basic:
RESET <---> TxD
  GND <---> GND
When one server hangs, log into the other, and send a break (extended logic level 0) to the serial port.  That will pull RESET on the hung server low, causing it to reboot.

Saturday, March 3, 2018

Fast small prime checker in golang

Anyone who does any crypto coding knows that the ability to generate and test prime numbers is important.  A search through the golang crypto packages didn't turn up any function to check if a number is prime.  The "math/big" package has a ProbablyPrime function, but the documentation is unclear on what value of n to use so it is "100% accurate for inputs less than 2⁶⁴".  For the Ethereum miner I am writing, I need a function to check numbers less than 26-bits, so I decided to write my own.

Since int32 is large enough for the biggest number I'll be checking, and 32-bit integer division is usually faster than 64-bit, even on 64-bit platforms, I wrote my prime checking function to take a uint32.  A basic prime checking function will usually test odd divisors up to the square root of N, skipping all even numbers (multiples of two).  My prime checker is slightly more optimized by skipping all multiples of 3.  Here's the code:
func i32Prime(n uint32) bool {
    //    if (n==2)||(n==3) {return true;}
    if n%2 == 0 { return false }
    if n%3 == 0 { return false }
    sqrt := uint32(math.Sqrt(float64(n)))
    for i := uint32(5); i <= sqrt; i += 6 {
        if n%i == 0 { return false }
        if n%(i+2) == 0 { return false }
    return true

My code will never call isPrime with small numbers, so I have the first line that checks for two or three commented out.  In order to test and benchmark the function, I wrote prime_test.go.  Run the tests with "go test prime_test.go -bench=. test".  For numbers up to 22 bits, i32Prime is one to two orders of magnitude faster than ProbablyPrime(0).  In absolute terms, on a Celeron G1840 using a single core, BenchmarkPrime reports 998 ns/op.  I considered further optimizing the code to skip multiples of 5, but I don't think the ~20% speed improvement is worth the extra code complexity.

Saturday, February 24, 2018

Let's get going!

You might be asking if this is just one more of the many blog posts about go that can be found all over the internet.  I don't want to duplicate what other people have written, so I'll mostly be crypto functions sha3/keccak in go.

Despite a brief experiment with go almost two years ago, I had not done any serious coding in go.  That all changed when early this year I decided to write an ethereum miner from scratch.  After maintaining and improving https://github.com/nerdralph/ethminer-nr, I decided I would like to try something other than C++.  My first attempt was with D, and while it fixes some of the things I dislike about C++, 3rd-party library support is minimal.  After working with it for about a week, I decided to move on.  After some prototyping with python/cython, I settled on go.

After eight years of development, go is quite mature.  As I'll explain later in this blog post, my concerns about code performance were proven to be unwarranted.  Although it is quite mature, I've found it's still new enough that there is room for improvements to be made in go libraries.

Since I'm writing an ethereum miner, I need code that can perform keccak hashing.  Keccak is the same as the official sha-3 standard with a different pad (aka domain separation) byte.  The crypto/sha3 package internally supports the ability to use arbitrary domain separation bytes, but the functionality is not exported.  Therefore I forked the repository and added functions for keccak-256 and keccak-512.  A common operation in crypto is XOR, and the sha3 package includes an optimized XOR implemenation.  This function is not exported either, so I added a fast XOR function as well.

Ethereum's proof-of-work uses a DAG of about 2GB that is generated from a 32MB cache.  This cache and the DAG changes and grows slightly every 30,000 blocks (about 5 days).  Using my modified sha3 library and based on the description from the ethereum wiki, I wrote a test program that connects to a mining pool, gets the current seed hash, and generates the DAG cache.  The final hex string printed out is the last 32 bytes of the cache.  I created an internal debug build of ethminer-nr that also outputs the last 32 bytes of the cache in order to verify that my code works correctly.

When it comes to performance, I had read some old benchmarks that show gcc-go generating much faster code than the stock go compiler (gc).  Things have obviously changed, as the go compiler in my tests was much faster in my tests.  My ETH cache generation test program takes about 3 seconds to run when using the standard go compiler versus 8 seconds with gcc-go using -O3 -march=native.  This is on an Intel G1840 comparing go version go1.9.2 linux/amd64 with go1.6.1 gccgo.  The versions chosen were the latest pre-packaged versions for Ubuntu 16 (golang-1.9 and gccgo-6).  At least for compute-heavy crypto functions, I don't see any point in using gcc-go.

Sunday, February 4, 2018

Ethereum mining pool comparisons

Since I started mining ethereum, the focus of my optimizations have been on mining software and hardware tuning.  While overclocking and software mining tweaks are the major factor in maximizing earnings, choosing the best mining pool can make a measurable difference as well.

I tested the top three pools with North American servers: Ethermine, Mining Pool Hub, and Nanopool.  I tested mining on each pool, and wrote a small program to monitor pools.  Nanopool came out on the bottom, with Ethermine and Mining Pool Hub both performing well.

I think the biggest difference between pool earnings has to do with latency.  For someone in North America, using a pool in Asia with a network round-trip latency of 200-300ms will result in lower earnings than a North American pool with a network latency of 30-50ms.  The reason is higher latency causes a higher stale share rate.  If it takes 150ms for a share submission to reach the pool, with Ethereum's average block time of 15 seconds, the latency will add 1% to your stale share rate.  How badly that affects your earnings depends on how the pool rewards stale shares, something that is unfortunately not clearly documented on any of the three pools.

When I first started mining I would do simple latency tests using ping.  Following Ethermine's recent migration of their servers to AWS, they no longer respond to ping.  What really matters is not ping response time, but how quickly the pool forwards new jobs and processes submitted shares.  What further an evaluation of different pools, is that they often have multiple servers for one host name.  For example, here are the IP address for us-east1.ethereum.miningpoolhub.com from dig:
us-east1.ethereum.miningpoolhub.com. 32 IN A
us-east1.ethereum.miningpoolhub.com. 32 IN A
us-east1.ethereum.miningpoolhub.com. 32 IN A
us-east1.ethereum.miningpoolhub.com. 32 IN A

Even though has a ping time about 40ms lower than, the server usually sent new jobs faster than  The difference between the first and last server to send a job was usually 200-300ms.  With nanopool, the difference was much more significant, with the slowest server often sending a new job 2 seconds (2000ms) after the fastest.  Recent updates posted on nanopool's site suggest their servers have been overloaded, such as changing their static difficulty from 5 billion to 10 billion.  Even with miners submitting shares at half the rate, it seems they are still having issues with server loads.

Less than a week ago, us1.ethermine.org resolved to a few different IPs, and now it resolves to a single AWS IP:  I suspect there are at least two different servers using load balancing to respond to requests for the single IP.  By making multiple simultaneous stratum requests and timing the new jobs received, I was able to measure variations of more than 100ms between some jobs.  That seems to confirm my conclusion that there are likely multiple servers with slight variations in their performance.

In order to determine if the timing performance of the pools was actually having an impact on pool earnings, I looked at stats for blocks and uncles mined from etherscan.io.
Those stats show that although Nanopool produces about half as many blocks as Ethermine, it produces more uncles.  Since uncles receive a reward of at most 2.625 ETH vs 3 ETH for a regular block, miners should receive higher payouts on Ethermine than on Nanopool.  Based solely on uncle rate, payouts on Ethermine should be slightly higher than MPH.  Eun, the operator of MPH has been accessible and responsive to questions and suggestions about the pool, while the Ethermine pool operator is not accessible.  As an example of that accessibility, three days ago I emailed MPH about 100% rejects from one of their pool servers.  Thirty-five minutes later I received a response asking me to verify that the issue was resolved after they rebooted the server.

In conclusion, either Ethermine or MPH would make reasonable choices for someone mining in North America.  This pool comparison has also opened my eyes to optimization opportunities in mining software in how pools are chosen.  Until now mining software has done little more than switch pools when a connection is lost or no new jobs are received for a long period of time.  My intention is to have my mining software dynamically switch to mining jobs from the most responsive server instead of requiring an outright failure.

Thursday, December 14, 2017

Mining with AMDGPU-PRO 17.40 on Linux

A 17.40 beta was released on October 16, with a final release following on October 30th.  There have been some issues with corrupt versions of the final release, but I think they are resolved now.  I encountered lots of problems with this release, which was much of the motivation for making this post.

Until earlier this year, the AMDGPU-PRO drivers were targeted at the new Polaris cards, and support for even relatively recent Tonga was lacking.  Because of this, I was using the fglrx drivers for Tonga and Pitcairn cards.  The primary reason for upgrading now is for large page support, which improves performance on algorithms that use a large amount (2GB or more) of memory.  With the promise of better performance, and since fglrx is no longer being maintained, I decided to upgrade.

I've been using AMDGPU-PRO with kernel 4.10.5 for my Rx 470 cards, so I decided to use the same kernel.  I can't say there is any problems with using a newer kernel like 4.10.17 or even 4.14.5, so they might work just as well.  I left the on-board video enabled (i915), so I would not have to be connecting and disconnecting video cables when testing the GPUs.  After installing Ubuntu 16.04.3, I updated the kernel and rebooted.  For installing the AMDGPU-PRO drivers, I used the px option (amdgpu-pro-install --px), as it is supposed to support mixed iGPU/dGPU use.

My normal procedure for bringing up a multi-GPU machine is to start with a single GPU in the 16x motherboard slot, as this avoids potential issues with flaky risers.  Even with just one R9 380 card in the 16x slot, I was having problems with powerplay.  When it is working, pp_dpm_sclk will show the current clock rate with an asterisk, but this was not happening.  After two days of troubleshooting, I concluded there is a bug with powerplay and some motherboards when using the 16x slot.  When using only the 1x slots, powerplay works fine.

Since I wasn't able to use the 16x motherboard slot, testing card and riser combinations was more difficult.  Normally when I have a problem with a card and riser, I'll move the card to the 16x slot.  If the problems go away, I'll mark the riser as likely defective.  Mining algorithms like ethash use little bandwidth between the CPU and GPU, so there is no performance loss to using 1x risers.  Even the slowest PCIe 1.1 transfer rate is sufficient for mining.  Using "lspci -vv",  I could see the link speed was 5.0GT/s (LnkSta:), which is PCIe gen2 speed.  Reducing the speed to gen1 would mean lower quality risers could be used without encountering errors.

My first thought was to try to set the PCIe speed in the motherboard BIOS.  Setting gen1 in the chipset options made no difference, so perhaps it is only the speed used during boot-up before the OS takes over control of the PCIe bus.  Next, using "modinfo amdgpu", I noticed some module options related to PCIe.  Adding "amdgpu.pcie_gen2=0" had no effect.  Apparently the module no longer supports that option.  I could not find any documentation for the "pcie_gen_cap", but luckily the open-source amdgpu module supports the same module parameter.  By looking at amd_pcie.h in the kernel source code, I determined "0x10001" will limit the link to gen1.  I added "pcie_gen_cap=0x10001" to /etc/default/grub, ran update-grub, and rebooted.  With lspci I was able to see that all the GPUs were running at 2.5GT/s.

For clock control, and monitoring I've previously written about ROC-smi.
====================    ROCm System Management Interface    ====================
 GPU  DID    Temp     AvgPwr   SCLK     MCLK     Fan      Perf    OverDrive  ECC
  3   6938   66.0c    100.172W 858Mhz   1550Mhz  44.71%   manual    0%       N/A
  1   6939   64.0c    112.21W  846Mhz   1550Mhz  42.75%   manual    0%       N/A
  4   6939   62.0c    118.135W 839Mhz   1500Mhz  47.84%   manual    0%       N/A
  2   6939   77.0c    123.78W  839Mhz   1550Mhz  64.71%   manual    0%       N/A
GPU[0]          : PowerPlay not enabled - Cannot get supported clocks
GPU[0]          : PowerPlay not enabled - Cannot get supported clocks
  0   0402   N/A      N/A      N/A      N/A      None%              N/A      N/A
====================           End of ROCm SMI Log          ====================

I also use Kristy's utility to set specific clock rates:
ohgodatool -i 1 --mem-state 3 --mem-clock 1550

Unfortunately ethminer-nr doesn't work with this setup.  I suspect the new driver doesn't support some old OpenCL option, so the fix should be relatively simple, once I make the time to debug it.