Welcome to EnviroDIY, a community for do-it-yourself environmental science and monitoring. EnviroDIY is part of WikiWatershed, an initiative of Stroud Water Research Center designed to help people advance knowledge and stewardship of fresh water.
New to EnviroDIY? Start here

MMW Data Outage?

Home Forums Monitor My Watershed MMW Data Outage?

Viewing 22 reply threads
  • Author
    Posts
    • #14311
      Robert S
      Participant

        Is there a problem with the data age not updating on the MMW site or an outage of cellular coverage?

        I am seeing a lot of sites that haven’t been updated in the last 7 hours (at the time of this posting).

         

        Robert

        Attachments:
      • #14313
        Cal
        Participant

          I know Hologram is having trouble. None of my devices could connect for past 7 hours.

        • #14314
          Robert S
          Participant

            Thanks.

            I see some sites came back online @ 7:15 UTC but some are still reporting out.

          • #14315
            Cal
            Participant

              All my devices came back online after 3:30 am ET – outage was around 15 hours by Hologram. I’m going to look at alternate providers if this happens again.

            • #14318
              Heather Brooks
              Keymaster

                Hi folks, chiming in here to let you know that we now have a dedicated Monitor My Watershed forum! You can find it in the dropdown options under Forums in the main menu. I’ve moved this topic over; it should not affect the existing conversation in any way but should make it more discoverable when members are looking for help with MonitorMW.

              • #14319
                Jim Moore
                Participant

                  I reposted this here from my post this morning on “infrastructure and equipment” forum:

                  “I noticed that what appears to be a general shutdown of the 2G network in the N. Chester county area which occurred Sunday, 7/12, at 13:45 EDT.  I have 8 stations in the Great Marsh Institutes’s network.  Three of these woke up this morning but the others are still mute.

                  According to Hologram it looks like 2G is on borrowed time

                  Does anyone have any details on the time line for remaining 2G support?



                  @shicks

                  Does Stroud have any large scale upgrade plans?  I understand that 4G requires new modems and if so are the 2G modems I just purchased a few months ago now scrap!?  Maybe we could put in a group order for th 4G modems to at least get a quantity discount and sell the 2G modems on ebay if they have any intrinsic value.”

                   

                • #14321
                  Shannon Hicks
                  Moderator

                    There was a major worldwide outage with Hologram yesterday that is still being resolved. You can read more about the details here on their status page: https://status.hologram.io/

                    So it’s not a MonitorMyWatershed outage, it has only affected EnvidoDIY stations with 2G modems, because none of our stations with 4G modems were affected. T-Mobile is the only provider for 2G right now, and they are planning to decommission their 2G network at the end of 2020, so we have been making plans to upgrade all of our existing stations from 2G to 4G by the end of the year. This 2G outage gives us a glimpse of what it would look like in January when all the 2G stations will drop offline. Upgrading from 2G to 4G involves visiting each station, reprogramming the Mayfly, and replacing the modem and antenna, and adding a LTEbee adapter. So it requires about $100 worth of hardware and visiting each station, of which we have about 40, so it’s a time-consuming process but will be necessary to keep everything online.

                    So any stations currently offline right now are likely using 2G hardware, and Hologram says things should be returning to normal soon. The owners of these stations should start to consider upgrading their hardware before the end of the year in order to prevent the permanent 2G outage that is inevitable. And if your 2G station doesn’t come back online in the next few days after everything returns to normal, you might visit it and check the battery. Sometimes cell outages stress the battery by causing excessive connection times every 5 minutes, which can drain the battery faster, especially in areas that are shaded by the full leaf canopy this time of year.

                  • #14322
                    Jim Moore
                    Participant

                      Thanks for the update, Shannon.  Are there any plans to buy a quantity of 4G hardware in the expectation of a discount?  I will need at least 8 and would be glad to help out where needed.

                      • #14326
                        Shannon Hicks
                        Moderator

                          This is an equipment question and not related to MMW, so I’ll answer that in your other forum thread.

                      • #14323
                        Jim Moore
                        Participant

                          @heather

                          Since it’s not a MMW issue should I move my technical questions back to my original post on “infrastructure and equipment” forum?

                        • #14324
                          Robert S
                          Participant

                            @heather

                            My bad for starting this post in the wrong category. I should have paid more attention!


                            @shicks

                            Forgive my ignorance about the technical details of cellular service but …
                            As for “any stations currently offline right now are likely using 2G hardware…”, SL168 (Punches Run was upgraded to 4G in 2019 but was still down. If the logger cannot connect to 4G, will it try to connect in other ways? I’ll have to read up a little more to understand it.

                             

                            Robert

                            • #14325
                              Shannon Hicks
                              Moderator

                                The station at Punches Run has been back online for several hours now. It has a 4G board on it. The cellular hardware is either 4G only or 2G only. The connectivity issues are occurring somewhere in Hologram’s system behind the scenes and I don’t know any of the details, other than all of our 2G stations are still offline, and some of our 4G stations were offline yesterday, but most (if not all) are back online, some were not affected at all. There have been several outage problems like this in the past 5 years that we’ve been deploying cellular-equipped loggers, and we usually just have to be patient and wait for the carriers and service providers to fix their issues. That’s why the Mayfly loggers have redundant on-board memory cards for storing sensor data. So no data has been lost, and owners will just have to visit their stations to retrieve the memory cards to fill in the data gaps from the periods of missing cellular data.

                            • #14328
                              Robert S
                              Participant

                                @shicks

                                Okay. I misunderstood.

                                I thought you were saying that ONLY 2G stations were affected. I did see that the Punches Run station was back up this morning.

                                Thanks for the clarification.

                                 

                                Robert

                              • #14332
                                Shannon Hicks
                                Moderator

                                  The Hologram network was back up and running early this morning, and there have been no further issues with their network today. All of the stations that lost connectivity on Sunday are functioning normally.

                                • #14333
                                  Anthony Aufdenkampe
                                  Participant

                                    Thanks Robert, Cal, Jim and others for this thread, and to @shicks for providing all those updates on the Hologram situation.

                                    I would like to add that our team overseeing the Monitor My Watershed portal has also noticed some issues on our end, which are separate from the Hologram issues, and we are working on improving MMW services.

                                    Many of these issues have been apparent since the COVID-19 work-at-home orders were put in place. Since then, we’ve noticed intermittent 502 & 504 gate errors responses from our server (either when browsing via the web or when posting data from a monitoring device) and we’ve noticed a general slowness when browsing the portal.

                                    One of the issues is that since COVID-19, the web servers at LimnoTech that host MMW are getting a lot more internet traffic as LimnoTech staff use our VPN to work from home and for other reasons.  To that end, LimnoTech done a number of upgrades to LimnoTech’s network that resulted in several planned 1-2 hour outages from time to time. Although we’ve seen some improvements, this hasn’t solved all the issues. We’re continuing to work on optimizing our network.

                                    The other issue is that Monitor My Watershed is running on an aging software stack, which could probably benefit from a not-so-trivial round of updates.

                                    The Stroud Water Research Center and LimnoTech are committed to long-term maintenance and development of the Monitor My Watershed data sharing portal. We have developed a roadmap for the next phase of development, and are presently exploring funding options to get started. If you have any leads for potential funding, please contact us! Every drop counts!

                                    Our development roadmap includes hosting MMW on Amazon Web Services (AWS) for enterprise-class up-time. It also includes addressing various “tech-debt” items that naturally accrue as a software system ages, along with many other items that we’ve listed in our Release 0.12 – Tech Debt / Refactor Code milestone on GitHub.

                                  • #14334
                                    Matt Barney
                                    Participant

                                      Thanks, Anthony, for this informative post! It speaks to some questions we at Trout Unlimited have had about MMW, its current performance, and future direction, as we look to expand our Mayfly deployments.

                                      Matt

                                      (cc @jlemontu-org)

                                    • #14335
                                      Anthony Aufdenkampe
                                      Participant

                                        Matt, I’m glad you found the update helpful! I would be interested in connecting with you to more of your perspectives and long-term needs, if that’s of interest to you.

                                      • #14376
                                        Anthony Aufdenkampe
                                        Participant

                                          Hey All, we found an issue that cropped up in late June on the database server for Monitor My Watershed. We’ll be doing a planned maintenance shutdown today at around 12:30 ET to fix the issue, and it will likely take about an hour. We’ll let you know when it is back up and running.

                                        • #14453
                                          Anthony Aufdenkampe
                                          Participant

                                            We think we found and fixed the major issue with slowness and 502/504 Gateway errors that have been increasing since this spring!

                                            Please let us know if you see any slowdowns from the web-browser or gateway errors from devices trying to post data, just in case there’s yet another issue we didn’t notice.

                                            @tslawecki , @htaolimno and LimnoTech’s IT team have done numerous fixes, tweaks and optimizations to our servers, virtual machines, network, routers, and firewall over the last 2-3 weeks. For example, we had some internal DNS issues and some internal API calls were going out to the global internet rather than staying inside our firewall, slowing things down. There were other similar issues. A lot of those tweaks appeared to make differences, but the outage this weekend showed us that we hadn’t found the root solution.

                                            We now believe that a major factor in the outages was that the database holding the measured values grew to be bigger than the RAM we allocated for the database virtual machine, which caused a major performance slowdown when combined with the other issues. We increased the database server virtual machine RAM to 60 GB (our max available), which gives us about a year of breathing room given the current database size of 43 GB (177 million data points!) and our current growth rate of almost 8 million data points per month.

                                            The long-term solution is to optimize software stack so that we’re not so constrained by server RAM, which we know is possible. We would want to do this in combination with work on the first 4-7 issues in our Release 0.12 – Tech Debt / Refactor Code milestone, and we are actively looking for funding to do this work, as I mentioned above.

                                          • #14458
                                            Matt Barney
                                            Participant

                                              Great news, @aufdenkampe! Thanks to you and your team.

                                              I ran a test overnight, with a Mayfly sampling every 5 minutes. Out of 179 samples sent to MMW, only one received Response Code 504, at 04:30 MST, Aug 7th. All other POST messages received successful response codes (201). The 04:30 point did not get saved to the MMW database. What I’ve observed in the past was that the ‘504’ points still got saved in the database. In any case, this appears to be an improvement compared to my previous tracking of 504 errors.

                                              There were 4 other sample points during my test which the Mayfly saved to the SD card but apparently never attempted to send to MMW, as there were no “Sending data” nor “POST” messages in the log at those times. I believe this is a Mayfly/Xbee3 issue, not a MMW issue. I’ve only seen it when using an LTE modem, not when using WiFi.

                                              Best,

                                              Matt

                                              Trout Unimited

                                            • #14462
                                              neilh20
                                              Participant

                                                Hey good to hear.

                                                I haven’t been able to do a lot of testing, and I was out yesterday but I enabled a laptop computer to monitor one beta system overnight that is using verizon starting at 9pm PST. (though I forgot to add the power cord to the laptop and it turned off after 2hrs !!   ).    Its sampling at 15minutes, taking 8 readings, and pushing the 8 updates every 2hours, at an offset of 7 minutes. That is at 23:07, 01:07, 03:07.   The POST timeout is tighter at 5 seconds, if it doesn’t get a response it records it as a 504. I’ve created a POSTLOG.TXT on the uSD that records all post attempts. If it doesn’t get a 201 it queues the readings and then retrys on the next sucess 201.

                                                Looking at the POSTLOG.txt this morning, it has mostly got 504’s, with a few 201s.

                                                The Debug Log that I got from a POST of 8 readings at PST 23:07pm (2020-08-07T07:07:00-08:00 ) were all 504

                                                Downloading from MMW the .csv file this morning, and looking at the records, a good number 24 readings didn’t make it to the database, but those that did, all made it.

                                                I’ll set up some more testing later today.

                                                https://github.com/ODM2/ODM2DataSharingPortal/issues/483

                                                https://github.com/EnviroDIY/ModularSensors/issues/194

                                                 

                                              • #14463
                                                neilh20
                                                Participant

                                                  Over the last couple of days I’m getting very good response when using a WiFi, and the response time for 201 ack is sub 1second.

                                                  This is a fast check with 2min sampling time, and SendX=2, so delivery every 4minutes. I get a response typically under 0.5seconds, and occasionally ~ 0.6Seconds. So for 1250 messages all have been delivered, 1st time or subsequent retrys..

                                                  For the beta verizon system, with sampling at 15minutes, and SendX=8, that is wireless connection every 2hrs, and timeout of 5seconds, there are burst of successful delivery with ack 201. The ack time is sometimes at about 1.4Seconds, but mostly when successful  at about 4.5seconds. So I’m guessing this is something to do with Verizon’s network. I’m going to have to change the timeout back to the 10seconds for better characterization.

                                                  Thanks to the MMW team for finding the issues and getting it responding.!!!

                                                • #14464
                                                  Matt Barney
                                                  Participant

                                                    I repeated my test for another ~48 hour run, sampling every 5 minutes, but this time using WiFi instead of XBee3 cellular. All of my 548 sample points made it to the MMW database, and all POST messages sent by the Mayfly received successful response code 201. So data upload via cellular appears to be significantly less reliable, even when cell signal is good.

                                                  • #14465
                                                    Anthony Aufdenkampe
                                                    Participant

                                                      Neil & Matt, thanks for that very good news from your testing!

                                                      I’m really glad to hear that everything seems to be working well again!!!

                                                    • #14466
                                                      neilh20
                                                      Participant

                                                        The response is great.

                                                        For my WiFi/Xbee S6 accelerated updates 2min sampling update every 4minutes the ACK time over 700 POSTS time is between 200mS and  774mS.  All POST succeeding 1st attempt.

                                                        For my Verizon/Xbee LTE at 15minutes sampling the ACK time is typically 5sec, very occasionaly about 1.5sec, and also 7Sec. For this test it delivered the outstanding readings that weren’t delivered previously, and then the new readings.

                                                        I am working on new feature Reliable Delivery, as cellular wireless range can vary and be unreliable. Often though there are periods of greater reliability (wind in the right direction). So if the first POST attempt doesn’t succeed, it is serialized to a QUExx.txt file to be retried when there is a connection.

                                                    Viewing 22 reply threads
                                                    • You must be logged in to reply to this topic.