User Tools

Site Tools


pdc:progress
Date Log File
Dec 12 Professor Carl and I tried booting up the older Raspberry Pis with no luck. We thought that the HDMI cord could be the reason, but after testing it with the new HDMI cord from Switch, I found the same result as before and I believed that there is something else wrong with it. Maybe I missed something. I also read through some articles from the 2016 summer research. None
Dec 13 I tried the TV in the common room of the Mississippi townhouse as a monitor and both older RPis didn't show any result, The lab was locked so I couldn't get in and use the computer monitors. Another reading day.None
Dec 14 Professor Carl brought one older RPi with him and that one worked well. I tested out the other two older RPis I had and found one of them broken. The original HDMI cord worked. All SD cards worked well and I replaced the old IP addresses with new ones. However, another RPi also seemed broken when I tested out more SD cards. I will try that one again tomorrow and hopefully, it gets back to normal. None
Dec 15 The newly defective older RPi stayed broken. Now only one older RPi can still work. The RPi with model 3 imager that Professor Carl gave to me only worked with the Worker4 SD card. It couldn't read Master and Worker1 SD card for some reason. Therefore, following the instruction, I used the older RPi with the Master SD card and RPi 3' with the Worker4 SD card and connected them through ethernet. I tested them out and they both worked very well at first, but the Master node suddenly didn't have “mpiexec” file anymore and we couldn't figure out why. I will try to fix it tomorrow. None
Dec 16 Professor Carl found out the reason for the problem I had yesterday was that the right access was not through file “mpich2” but “mpich3”. We managed to connect two RPis perfectly and noted that when testing out the connection with the command: mpiexec -f machinefile -n 2 ~/mpich2-build/examples/cpi, I must do it on the master node, not the worker node; also, do not type in “machinefile”, “machine_file” is the correct one. The updated instruction didn't specify these things. I also checked out all 8 SD cards that I had and filled out the SD Card Chart. We wanted to get the cards that could work on RPi 3' for tomorrow's work. None
Dec 17 We decided to use Card 3.1 as the master card. I put another RPi 3' in the Pi tower and tried to set up the master node following the tutorial. However, I ran into the problem that I couldn't use the web browser to do any searching even though the master node was connected to the ethernet. I checked the ethernet and the cord and they were all working, so maybe something was wrong with the Pi itself. This halted my progress on downloading mpich. I have looked up the problem on my laptop and tried some of it but nothing useful happened, but I haven't tried one option: reinstalling the raspbian. None
Dec 18 We tried Card 3.2 today and it could connect itself to the internet, which helped me to continue my progress on setting up the master node. By the way, mpich1.5 was outdated and couldn't be found anymore so I installed mpich3.3. I was stopped at the “configuring process” step because I was using a different guide but turned out this one is fairly correct. I just used mpich3.3 and changed “-disable-fc” to “–disable-fortran” since I couldn't download fortran on Card 3.2 and that piece of code she gave was not working well. After an hour of “make”, I finally installed mpich and got to the testing step. Sadly, mpiexec could not be found again. I will find the reason tomorrow. None
Dec 20 I tried to solve the problem I had on Friday but it didn't go well, and I found out that the permission for accessing cpi file is denied even though the access control showed no problem. After a while, I moved on to connecting the master node to 2 worker nodes and I met a bug:ping: icmp open socket: Operation not permitted, which stopped me from detecting the IP addresses but could be solved by entering sudo chmod u+s /bin/ping. I did connect the master node to worker002, but worker003 couldn't identify the host. I will look into this tomorrow. None
Dec 21 I started with the problem of unable to connect the master node to worker003. Professor Carl and I found a peculiar solution online that we had to change the hostname inside /etc/hosts. You must sudo nano into the file and change the hostname on the last line from “raspberrypi” to “worker003”, so the machine can recognize and match the hostname. The next problem was about “mpiexec” on the master node. I couldn't execute the command to do the testing because mpiexec didn't have permission to write the cpi file, which was really strange since this never happened before. Professor carl tried to “sudo” run it through “mpich-install' file and it worked. I then gave the write permission to the user using sudo chmod u+w directory and “mpiexec” also worked through “mpich-build”. I finished mounting the hard drive on the master node, but I met a little trouble when I tried to connect HD to the other two worker nodes. When I typed in /mnt/nfs 192.168.1.124(rw,sync,no_subtree_check), the terminal will reply with bash: syntax error near unexpected token'('. Looking for a solution. None
Dec 25 I fixed the problem from Dec 21 with Professor Carl's help, and when modifying /etc/exports/ file, you need to do all of the commands on the master node so the worker nodes could access the file on the master node. I also met another problem: when I tried to mount the HD on the worker node following the instruction, the terminal showed me mount.nfs: Connection timed out, and the HD light started blinking, which didn't happen when I correctly mounted it on the master node. I looked up online and tried some solutions but they all returned the same result - connection timed out. None
Jan 4 The problem from last time was solved. First of all, the size of sda1(hard drive device) was pretty small and it had nearly nothing on it, so I mounted sda6 on the master and worker nodes; Second, when I was modifying the export file, I made /mnt/nfs worker IP address(rw,sync,no_subtree_check) a comment, but it had to be without the ”#“(not a comment). Now the master node and two worker nodes have been successfully mounted, and I think the light on the hard drive only started blinking when I no longer used it. To make sure that I don't need to enter the commands to mount the HD every time the system is rebooted, I put the commands in the .bashrc file so it will automatically mount the HD. However, we still don't know why we need the USB hub if we could just mount the hard drive on the nodes through the router. None
Jan 5 I learned how to write the shell script and tried to use a single command to mount the HD instead of typing in several commands in the terminal. Putting the commands in .bashrc file is still better and easier, but it's always better to learn something new. None
Jan 6 Dr.Carl decided to try something new and I was really excited when I knew that I could contribute to a pandemic simulation. I am a slow reader so I spent most of my time going through the materials. None
Jan 7 A reading day. The tutorial went through every piece of code with detailed explanation, and I had to remember what every variable represented. None
Jan 8 Another reading day. In the end I finished the reading and I am now ready to move on setting and running the program. I tried to compile the c program on the master node but X11 library was not installed on the RPI OS, so I will find solutions tomorrow following the instruction. None
Jan 9 I tried to build and run the parallel version of the pandemic simulation. However, when I entered mpirun -machinefile machines -np 6./Pandemic-mpi, the terminal gave me a bunch of errors and the first one was unable to open host file: machines. I tried to look up online and found nothing helpful. I also tried changing the pathname, putting “Pandemic-mpi” directory in the file manager, replacing “machines” with IP address and also possible directories. None of them worked. None
Jan 10 I tried some possible solutions that Dr.Carl provided, but they didn't fix the problem. None
Jan 11 Dr.Carl found that mpirun and mpiexec serve the same function, but for running the pandemic simulation, Dr.Carl added a piece of code in Pandemic.c file so if the system doesn't have X11 library the program will pass that part and continue; moreover, if we didn't include -machinefile machines then the program could run perfectly, and if we added that piece of code, the same problem from two days ago happened again. Also, we noticed that the three RPis in the cluster couldn't really talk to each other. The reason might be that I didn't put the same PATH in the .bashrc file of other two worker nodes, and in the machine_file of the master node, the IP addresses of two worker nodes were missing. I then fixed those and even made all the names of directories from three RPIs the same, but it still showed me bash: /home/pi/mpich-install/bin/hydra_pmi_proxy: No such file or directory. None
Jan 12 Same problem from yesterday still happened and Dr.Carl also found it odd to have this kind of problem. I then tried to learn how to put the timer in the pandemic simulator. None
Jan 13 I installed new MPI on both worker nodes and made “mpich-install” directory on both nodes so they can have the same path as the master node in the ”.bashrc“ file. This took the whole afternoon, so I will try to run them tomorrow. None
Jan 14 I tested the cluster today and the three nodes could still not talk to each other. I found that in the “mpich-install” file on the master node, there were other four directories that the worker nodes didn't have: bin, include, share, lib and share. If I could have them, then the path can match. None
Jan 15 Dr.Carl came in and tried to fix the problem and it turned out that we didn't need to put the same pathname in the .bashrc file for worker nodes, the master node will use it and search on the worker nodes. The problem was that since worker002 and worker003 were set up back in 2016, so the path setting was different from nowadays. They had rmpi instead of pi, and the bin directory was in the path or in mpi directory for some reason. We then created a bin directory in mpich-install and copied everything over, which made the nodes talk to each other. Now I could try using multiple processor cores to run the pandemic simulation. None
Jan 16 I tried using multiple processor cores to run the pandemic simulation and it seemed that I needed to have this program on all Pis, so I downloaded it on other two worker nodes, but on worker003, mpiexec and mpirun couldn't be found anymore, so I changed the path again to make them work; but when I tried to make the program, it showed me mpicc -o Pandemic.c-mpi Pandemic.c -DSHOW_RESULTS. Since worker002 could run the program, I tried using only master and worker002. However, when I ran the program with the help of worker002, the terminal showed me “Assertion failed in file /home/pi/mpich-install/mpich-3.3/src/mpid/ch3/channels/nemesis/src/ch3_progress.c at line 782: pkt→type >= 0 && pkt→type < MPIDI_CH3_PKT_END_ALL internal ABORT - process 3”, which stopped me from going further. None
Jan 18 We decided to move the pandemic simulation program to the hard drive, so I created a directory named “Project” in “mnt/nfs” and moved “Pandemic-mpi” in it. I successfully ran the program but failed to connect it to the worker nodes. None
Jan 19 I noticed that the worker nodes were denied by server to access while mounting the master node:/mnt/nfs and realized that the IP addresses for the two worker nodes changed again, so I changed the IP addresses in “/etc/exports” and according to the finding online, I also added root_squash, which made the mounting on the worker nodes work. However, the “assertion” problem from last Saturday still happened. None
Jan 20 It is strange that the IP addresses of the other two worker nodes kept changing, which made the mounting fail sometimes since /etc/exports didn't have the right IP addresses; so I put all possible IP addresses in /etc/exports to save me some time. Also, I found out that worker003 couldn't talk back to master because it needed the password. After I fixed this by generating a key and share it to master, the pandemic simulation worked surprisingly! Now it can work with 4 cores from master and 4 cores from worker003, but neither master and worker003 could work with worker002. We tried to do the same(generating a key and share it to master) but it wasn't the solution. We couldn't figure out how to solve it so far. None
Jan 21 I put MPI_Wtime() in Pandemic.c and now the program will show the time each core takes to run and give the result. We tested it a few times and it worked as we expected. worker002 still couldn't work with master and worker003. We decided to do the time testing tomorrow and just made worker002 an independent case. None
Jan 22 Surprisingly, worker002 could work with master and worker003 this afternoon, so I just started testing the program from 1 core to 12 cores. Here is the link to Pandemic simulation, some notes and the time chart are in it. Since the program would give different time usage every time I tested it, I calculated the average time of three results and put it in the chart. None
pdc/progress.txt · Last modified: 2021/01/22 23:15 by lix2