Hello everyone. Today we're going to talk about your project proposal. We're now getting ready to propose our own data mining project. Specifically, you need to identify the key components to include in your proposal. The main question of course, what we do? We're ready to the proposal project. Specifically your project proposal is a way just described. What do you plan to do in your data mining project? Of course we know that at the proposal stage, we don't have other information, we don't have all the details, we may not know exactly what we're going to find and how things may turn out. So it's totally fine, if you don't. But actually we're expected to have other information available. That is okay and because of that. Your proposal is a plan. It's a tentative and it's definitely okay to change or update to your proposal later. So don't fear that you're committing to your proposal and you have to follow exactly what you are applying right now. That's actually the nice part about exploration because it was the data analytics. Many times, you start by asking some interesting questions, but we don't necessarily know what we going to find. Things can change. It uses the analytical thinking and the reasoning that really guide us in this process. Anyway, so at this stage answering to the proposal, you come up with a good idea and you have a good plan for carrying it out, but knowing that not all the information is available and that things can change. Of course now the big question is that, yeah, we need to submit our proposal. What should be included in a project proposal? Think about it. If you are to now say I just ask you to give me all a proposal of your data mining project. What are the key components you think you should include? Let's start with the first one. That is your project title. It's very important to have a good project title, because many times I have seen people just say, "Data mining project." Of course that is the data mining project for each of the student or anybody. But it's really not helpful or informative in any way. You really want to pick a good project title, which you should be concise, that means that you don't want a very long title that people can easily forget. It is concise so it's easy to remember, but also you want to make the title informative. The title should actually give people good idea in terms of what you're planning to do. Of course it's just the title so you cannot carry everything. The starting point is I just think about all the keywords in terms of maybe the application, or the problem, or the technique, or the data you're using. Just see what are the most important keywords to include and see whether you can embed those in your project title. Also it's optional, but I have seen people are choosing good acronyms or coordinate for their project. Which is also actually a good way to have a short but informative name that people can remember your project with. Do spend a lot of a time really think about what a project title you should use. Of course you can change it, but try to come up with a reasonably good one. The next part is usually referred to as abstract in the more like academic setting, and many times you may have or heard people using the term executive summary. That's really just intended as a summary. This is really just like a brief description of what your project is about. Because typically you may have a much longer report, but if you want to just give people a quick synopsis of your project. Here you typically use one or two paragraphs, and then you just nicely give a good overview, brief overview of what your project is about. This is a short, very short overview. Now we get into more of the detailed components of your proposal. First one, we typically use this introduction like a piece, so you could use a section in your report. What is the introduction about? You basically want to just introduce your project. What do you need to introduce? First one, what is the problem? Of course you'll say, "I'm working on a data mining project." But what is a problem you are trying to tackle? Just start by describing the problem. Then the next part is about, why is this important? This is a problem, but why this is a important problem to solve? Because you're talking about spending your time and effort on finishing it, or doing your data mining project on this particular problem. You want at least explain to some extent why this problem is important to solve. Also it is important to talk a little about what has been done already, because many times the problem you're trying to address is not brand new. Either the exact problem had been tackled by other people with some solutions already, or you are working on maybe not at the exact same problem, but it's related, so they maybe have some related problem that people have addressed. It is important for you to have a reasonably good understanding of what has been done already, because we really don't want to just repeat whatever other people have already done. Knowing what has been adjust already, existing solutions, but also important is about why there's still a gap, because giving what people have done already, they didn't really address this problem, so this is really where you briefly talk about the limitations or the gap, giving existing solutions. Data then says the stage for why you're working on this problem and why there's a potential contribution in your work. Because this is important, you want to be working on a project and you're actually contributing to something new. This is really like, again, setting the stage, so really think about the problem setting, the value of solving that problem, the limitations, and also potential contribution. Try to think about it in this way, because that really just says that this is important to address and I think I have a way to do it better. If I am to use a good example like I talked about my research projects, some of the examples I gave in the previous lecture. Let's say remember my plug-in hybrid electrical vehicle, my problem setting. Here, the problem is that we want to understand how the vehicle, particular the battery system performs under different driving behavior. It is an important problem because by understanding how that performs, then we can actually come up with a better solution or better design, and of course, we'll have better way of estimating the capacity of the battery system or the longer time, so it gives us this benefit if we can do it. But then limitations of prior solution. We work on that project, they were really not a good solution that actually ties together the data-driven approach, that it tries to connect the people to driving behavior with the battery system performance. They were various kinds of batteries system physical models trying to understand how the system would be expected to perform under some specific scenarios, but they really not connect to the real water usage. For our perspective, the potential contribution I was saying is that by collecting real-world data about users or driving behavior and the vehicles on the battery system performance, we are able to have a much more realistic model of how the battery system will perform, and then from there provide more precise estimation of all performance and also provide directions for the improvement. That's just one example, so thinking about your project and think about how you want to just provide that reasoning, why this a problem, why this is something interesting to work on? You have talked the introduction, so the next section, we usually go a little bit deeper to talk about the related work. In the previous slide was that you want talk about existing solutions, but just briefly because your goal is really more about highlighting the limitations of the existing work. Here you dive in a little bit further, so you can talk more specifically about what has been done already. This is just making sure that you have a good understanding of the related work. We say related work. This is a project I would like to work on, but what other related work that has been done already? Typically, depending on your project, you can usually group for your discussion of the ready to work into maybe two or three categories. This could be about specific topics or this could be more about the dataset, or the tools, or previous study and their findings. But the key idea is that you want to identify ones that are relevant and then be able to just group them in a way and then describe how they are related and how they are different. Because the key point is that you want to be able to show that I have a good knowledge of what has happened already in this field and how my work were build upon this. That is important because our goal is not to reinvent the wheel. We actually want to leverage whatever has been done already. There may have been a previous tool or method that works reasonably well, but I see a way to make them better. Or they are previous studied that look at one perspective but may be matched the other ones which are more interesting, so you could be adding new knowledge. Just think about how your project builds upon the prior work in terms of utilizing the ones that has been already [inaudible] a nice tool and as dataset, you can use, but also, think about how you may be able to compare across different methods or for your particular usage of scenario. Of course, you can look for particular improvement and also there may be new knowledge you can capture giving what has been done already. Those are usually, of course, this is really depending on your actual work and of course, you can expand. You are ready to walk as you move along your project. Because you may start by identifying a few, but as you go deeper, you may find other relevant one, which is totally okay. Then we said to right, would have all the information and the proposal stage. It's actually good, it's for you to update it further as you find them more information. But at the key point, when we talk about related work is about understanding what has been done and how your work improves upon that. We'll talk about introduction. We'll talk about a related work. The next one of course is really the core part, that is a proposed work. What are you proposing to do? What you said about what the problem is? What has been done already? But what you are going to tackle in your particular project? Typically, you can especially with the data mining project of course, you start by thinking about the datasets, the tools you can use, the key here is just to include some good description about what kind of data you can use? What kind of information is including this dataset? Are their existing tools that has certain functionality you can readily utilize? This again helps you to plan your work so you don't have to spend all the effort just creating the dataset or building a tool that should just be too much work. Leverage of what is available, clearly identify those. But then you get to the part of your main tasks, as we mentioned earlier, it's typically a good idea for you to plan your project using a set of prioritize list of tasks. The ideas are that there may be different things that I can explore and I'm not sure whether I can explore all of them. I may add other ones as I'm progress. These are all fine, but start always initial less of tasks and think about how you want to carry them out. This actually connects back to our discussion of the data mining pipeline. As I said that you have a problem setting, but then like to accomplish what do you want. Think about the whole pipeline and think about what you would do or can't do in each of those steps. Starting with the raw data, understanding the data, understanding stage. It's very helpful and actually really need it for you to just take your time, understand your data. You can do statistical analysis, just understand the various kinds of distributions, whether there's any kind of extreme values. Also, just visualize your data so you have a much better understanding of the data you're working with. Just the starting point. But then, you can move on and think about whether you need to do certain types of preprocessing and whether there's a particular warehousing or data management mechanism you can leverage. Because many times if you spend a little bit time here, preparing your data or manage your data properly, it will make much easier for your analysis data. Then of course, you get to the core parts of the data modeling. Because depending on your problem, you may be asking specific questions. Couldn't say [inaudible] , again, generalizing about the techniques we have talked about, frequent dependent analysis or whether this is a classification problem you're trying to address. Whether you want to use this as unsupervised clustering-based approach, or you're focusing more on anomalies, or this could be more of the temporal analysis or image, text, or graph, those are all the specific things you could do. Then once you have a general or less, these are other things I'd like to explore. Just put them down and have a reasonable discussion about how you would like to carry out those. Again, keep in mind that you may don't have all the concrete things unless they're right. You may have maybe a high-level saying, yeah, this is a classification problem. These are the methods I would like to explore, good. Or you may be say, "Okay, I'm not exactly sure, but there are some questions I'd like to explore and see whether I can answer these questions." All those are okay. This is really just your planning stage. Okay. Don't really be concerned about, "Oh, I don't know exactly what I'm going to do here." But ask some interesting questions. Because this again is your planning stage, you are the architect. You figure out. What would be some interesting question to explore? You may or may not find a good answer, or it may not turns out to be interesting pattern. Or you may not find this answer, which is a fine but asking the question at this stage is important. All right. You propose this. These are the things you would like to do. But also think about how you're going to evaluate your results. Because many times, like we said, with a data mining project, you don't just say, "Okay, I did this analysis, this is the results, here you go." Somebody has to take your results and they too they can use it, but you need to show how good your results are. How reliable your results are. How accurate your model is. You need to think about specifically how you're going to carry out the evaluation process. Starting point for us is that for your problem setting, what kind of evaluation metrics are important? Generally, think about effective measures. We say if it's classification, you'll talk about accuracy. You're talking about a frequent of patterns. You have the support value, you have the confidence, you have the correlation, all those values you can use to show how significant that is. Remember we're talking about the false positives or false negatives and how they may be different. One may be more important than the other. The many times is not a single metric. You may be using multiple metrics to evaluate different perspectives. Another thing what I really want to emphasize that when you consider your evaluation metrics, think about both the effectiveness measures but also the efficiency angle. Efficiency can be, how long does it take for you to carry out each of those steps. This is really just more like the latency or like the pseudocode. I say, how many data points can you handle in certain time-frame? Those are all important, especially if you are say, comparing different methods, or see whether this actually be useful, or feasible for the real-world setting. So the latency angle, the efficiency angle could be playing an important role if you're talking about potential trade-offs, between different approaches. If you have the evaluation metrics and there of course say, okay, how I'm I going to carry out the evaluation? This then gets to this experimental setup. What kind of data are you going to use? Are you going to split your data into different subsets? Are you going to use some certain portion as your training set and a testing set? We talk about a cross-validation, k-fold cross-validation, or some significance test? All these are important. Just think about the exam how you're going to compare. Again, keep in mind that you may not have everything figured out. This is the proposal stage. Just but you need to start thinking about it and write down what you have already and then we can, of course, refine that later as you move along in your project. Another angle I want to also point out is that many times in your evaluation you're probably comparing different approaches. For example, I say to solve this particular problem, there are a few different ways to do it. There may be some existing solutions and then maybe even your own solution. You may have a few different choices. All those are the things you can compare. Because when you're comparing the different methods using against the, of course, the evaluation metrics, then you actually have a much better sense of how one solution compared to the other and whether they perform better in this case over those other. Of course, at this stage, you don't have the actual evaluation yet you are really just planning out how you want to evaluate how your claim success or good performance. When you're talking about your end results or your solutions. This section may be a little bit confusing at the beginning. I really meant this is actually discussion section. This is a bit more of a push you further along this analytical reasoning direction. This will be just a journal like think about how you want to, one, plan out your project because this is really just about the project timeline. Because during our [inaudible] week term, so our schedule are different and you're working always different, just make sure you plan things out ahead of time so you know, generally, what are the tasks? When you expect to finish what? Is this actually realistic? Is it maybe too much or maybe you can actually do a little bit more. This is really just to help you to plan out, how you plan to carry out. Also you can talk a little about your current status because this will be updated later as you make progress in your project. But say in your current status, you can say, I have already obtained the dataset or I have some very initial reading of this, I would regard this as a tool working or something. It's like summarize for yourself but also reporting in a way. Just like how you plan to carry out this project. There's also actually something you can include is that potential challenges because you have a plan. This is what you would like to do but you foresee some particular maybe challenges or roadblocks. Say this piece I'm not sure I like to do it but there's some chance that it may not work out or I would like to ask this particular question but it may not turn out to be very useful or easy or something. You can talk about this. Again, if you don't have anything to talk about here, don't worry but this is really a place where you can just reason about. This is what I plan to do, I propose it to do this but what are the potential scenarios that may redirect or require certain changes? This also gets to this notion of think about alternative approaches or whether you have a backup plan if something doesn't work out. You may not need this at all but it's just a janitor good to think about because sometimes you say, "Okay, I like to look at those questions or use this dataset," but maybe it's doctrinal and I couldn't really use this dataset or doesn't have the information I need. Then in that case, my backup plan may be say, augmented with another dataset. I think I could also use this dataset to combine it, maybe or just say if this dataset doesn't work out or this tool doesn't work out as you know I have another option I could do this. Those are the things that we just like planning and just thinking about how you may be able to handle or address a potential challenges down the road. Generally from any proposal or report document, you start with a good overview in the abstract and the intro you talk about the details but it's always good for you to have a conclusion or a concluding section which is going to, again, tie everything back together. This is really just more like it's meant as a project summary. When you're writing this, you may feel this may be a little bit repetitive of what you have said already but just keep in mind, it'll be very helpful for your audience or your reader to be able to have more this good picture and always a good summary both at the beginning and the end. Also there are two particular pieces that you may want to include as you write your conclusion. This is of course when you finish your project. This is a so proposals just so you don't have those yet but typically, when you finish your project, your final report, so your conclusion, it's important for you to of course, give the summary but then really highlight your key findings because people will be paying attention to that because they know now you're summarizing and concluding your report. What are the most important findings in your work? Also, it's actually always interesting to think about future work. Of course, the plan is that by the time you finish your project, you're done. You're not obligated to do more but this future work discussion or thinking is very important because your daily life or your project doesn't really stop there. You could say, this is what I have accomplished so far but if I am to carry this further or if somebody else is to carry this further, there are some interesting questions and tricks brought forth. There's actually a very good thinking, again, using this architect review. You say this is what I would like to do but by the time we finish, there are also further things we can explore. Just keep all that in mind as you're writing your conclusion section. That's really the main pieces that should be included in your proposal. Let's look about some of the logistics angle. As you plan your proposal, as you think about all these specific details, you want to include in your proposal submission. There are specific two things we would ask you to submit. One is a proposal slides. This is viewed as more like a presentation slides. We're not going to do presentation, but the slides is actually a very nice way for you to quickly capture your key ideas. The other one of course is a proposal report. You will see of course there maybe some overlap redundant information being captured in the slides versus in report. But it is important to note that these two are really useful different purposes, and they are presented differently. Starting with a proposal slides. This of course, is intended as a presentation. If you are to give a quick presentation or overview, you're maybe giving this to your supervisor or your sponsor, just basically just convince them that you're working on an interesting problem and there's value in your work. In this case, what you're presenting was just like slides, typically 5-10 slides to give a good summary of your proposal. Because it's a summary, it's really not intended to be extensive. You would include maybe a good project title, at the side of your title is important. People will remember your project if you have a good title. Then you get to some more specifics. You may have one slide for your problem statements or as we said, where the problem is, why it's important, why there's potential contribution. Then you may have one slide and talk about related work and see how there's a gap for you to do further work. Then you may have one slide or two depending on the specific components. You can see this is what I'm planning to do. Talk about the datasets and talk about the tools, but the more concretely, what are the key questions or tasks you're trying to perform? Or you're going to evaluate? Also maybe one slide on timeline. As you can see, it's roughly the same structure as what we talked about it earlier. But because this is intended as like a brief presentation, keep it brief. Your goal is not you, give other information, show us all the technical details, this is a what I'm going to do in very detailed sense. Rather it's a good overview to convince people and actually to present. Is more like marketing presentation, just say, "Okay I'm going to do something interesting and there's a value in how am I going to do it." I wanted to say very briefly about more like the style of your presentation slides because I know some of you, or most of you may have given presentations in very different settings, and so your problem expert already. But, I want to say, I think it's just very useful to always think about how your slides look like when you're presenting them to people. Think about it once the presentations or the slides you have used before or think about scenario when you're seeing other presentations, seeing other peoples' slides, what you liked, what you didn't like, what you think could be better. That is a few key points I want to highlight. First one you just keep your slides clean. You don't want to have busy slides because every time you put up a slide, it draws attention from the audience. You want to have good information there, but you don't want to confuse or really distract people with too much information. Just think about what are the key information you need put on your slides. Also, use larger funds that is important. For us, when you present the information, especially we're the ones who are presenting where we know all the details. We have the tendency of putting the other details in because we want to make sure we convey all the details. But that may be typically just too much for the audience. You want to be able to read and pick at the right level of information and the right level of summary, so it's not too detailed. By using larger fonts, it ensures that you don't put too much of text on a single slide. That's one, and it's easier for the audience to read, and it's much more effective in terms of communicating your key ideas. Also, think about effective ways of using color or boldface, boldface is useful too just to highlight the keywords or the key phrases. The color by using effective use of color, you really can draw people's attention to the most important part. That's very helpful, but just be careful not to use too many colors because that again gets too busy. Also some colors are just not shown well on a slide, it'd be too bright or too dark so just make sure you pick a few colors that you're comfortable with and just use a few but not too many colors. Also pictures. Many times, of course, a lot of the core content will be presented as a text, but pictures are very useful, especially in slides. When you're presenting, if you're to include some illustration or some picture, their much more effective and actually that conveys this message much better. You may have heard this term, a picture is worth a thousand words. It is, you really can effectively communicate some good information. Always think about how you can incorporate pictures in your presentation slides so that it can effectively communicate your ideas. That's the proposal slides, and the said slide, it's really intended as a good high level overview. Really make sure include the key ideas and the core pieces, but really avoid putting too much content in your slides. Then the other piece is the proposal report. This is the actual document you're going to write and you'll provide more details in your document. First, for the report format itself, I'm asking you to use the ACM proceedings template. The link is provided here, you can check it out. Some of you may have used already, some of you have never used it. Just check it out. Basically, depending on how comfortable you are, there's three different options. Make sure you use this two-column format. There are different formats for our course, please make sure you pick it so that is a two-column format. If you use Word regularly, then that's probably easy stuff. You can go there, there's one section that is for the Word document. You can just download that template, and they have examples. You can just use that to pick it and put in your content. That's one way to do it. But if you want to try or you feel familiar with Latex or Overleaf, those are different type setting, and actually are widely used for a lot of the ACM technical papers. Check it out. It's also actually reasonably easy to get started. Just see whether it's something you are comfortable with or something you'd like to learn. It is actually very useful. If you use that, Latex, you can actually run it on your desktop computer, Overleaf is actually online. Actually it's also good for sharing later on, if you want to use that, it's basically a web-based editing. You can edit directly there and you can see the corresponding PDF file. Again, just to get to the two-column format, make sure you use this sigconf template. That's basically ACM, specially into a group conference proceeding template. That's the generic one. Just pick that. This is about the template or the format of the report. What do you include in your report? As we talked about it before, include your title, and then all the sections we talked about earlier; starting with your abstract, introduction, related work, proposed work, which can be broken down into micro-sections, depending on how you want to organize it. Then you will have evaluation section, discussion, section conclusion. The references section is for, remember, your citations, if you talk about related work. You can cite a paper, you can cite an online article, or just maybe a tools you're using, all those are references you can add. Just to make sure you include those, so you're pointing to other resources while you are leveraging in your project. In Course 1, we talked about the KDD, which is the Knowledge Discovery and the Data Mining conference, which is the main research community for data mining. You should have a look at some of the KDD papers, but also definitely can go back check out some of the papers. That actually will give you a good sense of the general formatting and the flow of their work. Keep in mind those are research papers, so don't worry about the technical details, but you read just in terms of understanding how the story flows in that report. With that, I hope that you now have a good understanding, and are ready to write your project proposal. We will continue next time. We'll talk about how we go from the pupal stage, and make progress, to the checkpoint, and get to the final finishing line. That's all for today, thank you.