User Tools

Site Tools


pdc:mcgriff2017:progress
Date Log File
July 24 Professor Carl and I discussed the schedule and goals of our work. We opened a section for 2017 mcgriff summer research in DokuWiki and added few sub-sections to it. We went over some basic MapReduce features and how it works. We also decided what and where to begin our work. We began with Phoenix++ which is a C++ reimplementation of Google's MapReduce model for tasks of big data. We found a perfect website that explains Phoenix++ from scratch, and ran the given word count program to better understand. Then we did exercises and made table for different test of this program.phoenix_exercises.xlsx
July 25 We mainly worked with Matplotlib. Matplotlib is a python plotting library. We had few issues downloading Matplotlib in one of linux machines in WL136 because of sudo. So I used my laptop for this. First, using Phoenix++, I ran histogram program with “time -p” to get the time it took to run. I used large.bmp image for this, and I varied MR_NUMTHREADS from 1-4. Then I took the real and user time data and plotted as two graphs: MR_NUMTHREAD vs. realtime and MR_NUMTHREAD vs. usertime.
July 26 I started with plotting Phoenix++ MapReduce test programs using Matplotlib. Just like yesterday, I plotted two graphs for each test file. (MR_NUMTHREAD vs. realtime and MR_NUMTHREAD vs. usertime) I found the pattern that real time it takes to run decreases as MR_NUMTHREAD increases, but user time increases as MR_NUMTHREAD increases. I looked up the difference between real time and user time. real time is all elapsed time including time slices used by other processes and time the process spends blocked while user time is only actual CPU time used in executing the process. (https://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1) After plotting all the tests, I tried to fix the building error of two tests: pca and kmeans. In addition, I tried Phoenix2.0, but all of the tests have same issue when I tried to build it.
July 27 Dr.Carl helped me fixing building issue with phoenix2.0. The problem was that I had the wrong library file in the lib folder, so I deleted and ran make in phoenix2.0 directory. After fixing that, I measured real time and user time it takes to run phoenix 2.0 and phoenix ++. I plotted the data and compared 2.0 and ++ version, the result seems to have pattern on each test. Except for linear regression, phoenix++ takes much less time to run than phoenix2.0. Other than this, there was no certain result that tells us which version is better.
July 28 I tried to download Anaconda3, but it was a bit complicated to download since I had to change .bashrc file manually. After downloading Anaconda3, I downloaded Jupyter, and used Jupyter notebook. Then I plotted real time and user time of Linear Regression graph, and compared speedup and efficiency of phoenix2.0 and phoenix++ by varying number of thread. I repeated same thing with String Match program.
July 31 I finished what I was doing last week which is comparing and plotting real time and user time of speedup and efficiency of phoenix2.0 and phoenix++ by varying number of thread. First one I did was Matrix Multiply. Then I did Matrix Multiply.
August 1 I installed Theano which is a python library that enables defining, optimizing, and evaluating math expressions using multi-dimensional arrays. Then we discussed what kind of data we are going to use, and contacted the landscape analysis lab to ask for public student-developed datasets. Meanwhile, I had issues with installing tensorflow and python3, I had to reinstall Ubuntu on my laptop.
August 2 I reinstalled ubuntu on my laptop, and then I had to reinstall various software such as python3, anaconda, phoenix, tensorflow, theano, matplotlib, etc. Then I read few tutorials of Tensorflow.
August 3 I learned how to write a linear regression program using tensorflow from scratch. It helped me to understand how tensorflow and linear regression work. It is also related to understanding of deep learning and machine learning. I posted my source code here. Then I added a phoenix 2.0 page on dokuwiki with a brief explanation and source code. This page also includes error message of both kmeans and pca. Phoenix 2.0 Descripitionhttp://hive.sewanee.edu/kimj0/Tensorflow http://hive.sewanee.edu/kimj0/Python/
August 4 I realized that kmeans and pca for phoenix-2.0 work. So I plotted them. Then I considered a large dataset that can be downloaded and used for mapreduce. Here are some good dataset link. http://data.worldbank.org/ https://github.com/datacarpentry https://www.dataquest.io/blog/free-datasets-for-projects/ https://archive.ics.uci.edu/ml/datasets.html The next thing to consider is what questions MapReduce can be used to answer.
August 7 I printed Phoenix2.0 and Phoenix++ test programs and compared source code to understand mapreduce and how they are used for different purposes.
August 8 I prepared input files as described in the WMR Flight Data activity. To do this, I downloaded Flight Data from Bureau of Transportation. To test the files, I tried to write a program that uses the identity mapper and reducer on the data for each year
August 9 Professor Carl and I went to Professor Van de Ven's office to discuss what data Environment Department has and how to move them to our computers. I finished writing a program for reading csv file and print all the data out. I wrote in C++. Hive Link
August 10 I downloaded an extra year for flight data. (2013) We moved it into Professor Carl's hive folder. Then I tried to write a phoenix++ test program which reads flight data and prints them out. However, I am having few issues, I will fix it tomorrow hopefully.
August 14-17 Still working on writing a MapReduce test program that tests Flight Data and prints them out.
August 18 I wrote a test program that reads flightdata and prints them out. Download CSVReadfile
August 21 I did mostly what Professor Carl put on TODO List. This program now reads the whole file and only print out given number. It also calculates average delay time of the whole data. I am still working on delay with positive/negative only. Download CSVReadfile
August 22-24 Professor Carl and I wrote a mapreduce program that reads csv Flight data file and prints out using Phoenix++. We also put it on bitbucket Bitbucket Link
pdc/mcgriff2017/progress.txt · Last modified: 2017/08/28 11:37 by kimj0